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10 CROSS-REFERENCE TO RELATED APPLICATIONS 

This application claims priority under 35 U.S.C. § 1 19 to U.S. Provisional 
Application Nos. 60/271,406 entitled "Systematic Discovery of New Genes" filed 
February 27, 2001 and 60/333,726 entitled "Systematic Discovery of New Genes and 
Genes Discovered Thereby" and filed on November 29, 2001, the entire content of 
15 which are hereby incorporated by reference in their entirety. 

BACKGROUND OF THE INVENTION 

The genomes of organisms are large stretches of DNA. In many organisms, 
the function of a great part of the genome is unknown since it does not contain 

20 encoded genes. Because of advances in computerization, genomic sequences are being 
deposited in public databases at a dramatic rate. However, this information will be of 
little value to biologists if the tools to manage and interpret the information are not 
available and are not reliable. 

Today's scientists use advanced quantitative analysis and database 

25 comparisons to better manage the genetic information, and identify and define the 
relationship between sequences and the corresponding phenotypes. Increasingly, 
molecular genetics is shifting from the laboratory to the computer. However, the 
process of detecting genes in these sequences is still relatively slow. 

One promising use of bioinformatics to increase the efficiency of research 

30 involves studying a genome to determine the sequence and relationship to other 
sequences and genes in the genome in other organisms. This information is of 
significant interest to pharmaceutical and biomedical research to, for example, assist 
in the evaluation of drug efficacy and resistance. Genetic databases for organisms 
such as Saccharomyces cerevisiae, Escherichia coli and Mycoplasma pneumoniae are 

35 publicly available, but the ability to manipulate this data is limited. To make the 
manipulation of genomic information easier, sophisticated databases and search 
programs have been developed. 

Some well-known databases of genetic information include GenBank™, 
SwissProt and OMIM™ (Online Mendelian Inheritance in Man). GenBank™ is the 
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National Institutes of Health (NIH) genetic sequence database, an annotated collection 
of all publicly available DNA sequences (Nucl. Acids Res. (2000) 28:15-8). There are 
approximately 10,336,000,000 bases in the 9,103,000 sequence records as of October 
2000 (see www.ncbi.nlm.nih.gov/Genbank/). GenBank™ is part of the International 

5 Nucleotide Sequence Database Collaboration, which comprises the DNA DataBank of 
Japan (DDB J), the European Molecular Biology Laboratory (EMBL), and 
GenBank™ at the NIH. 

SwissProt is an annotated protein sequence database established in 1 986 and 
maintained collaboratively by the Swiss Institute for Bioinformatics (SIB) and the 

10 European Bioinformatics Institute (EBI). 

OMIM™ is a database catalog (www.ncbi.nlm.nih.gov/OMIM/) of human 
genes and genetic disorders authored and edited by scientists at The Johns Hopkins 
University. The database contains textual information and references, as well as links 
to MEDLINE and sequence records. 

15 The Entrez retrieval system, run by the National Center for Biotechnology 

Information (NCBI) at the NIH, can search several linked databases at a time. Entrez 
can search biomedical literature databases, GenBank™, SwissProt and other protein 
databases, three-dimensional macromolecular structures and OMIM. Searches can 
produce results in the form of related sequences and structural neighbors. 

20 A popular search program algorithm is BLAST (Basic Local Alignment 

Search Tool). BLAST is a set of similarity search programs designed to explore all of 
the available sequence databases regardless of whether the query is protein or DNA. 
The BLAST programs have been designed for speed, with a minimal sacrifice of 
sensitivity to distant sequence relationships. The scores assigned by a BLAST search 

25 have a well-defined statistical interpretation, making real matches easier to distinguish 
from random background hits. BLAST uses a heuristic algorithm which seeks local 
as opposed to global alignments and is therefore able to detect relationships among 
sequences which share only isolated regions of similarity (Altschul, S.F. et al. (1990) 
"Methods for assessing the statistical significance of molecular sequence features by 

30 using general scoring schemes," Proc. Natl. Acad. Sci. USA, 87: 2264-2268). 

Despite the strong computational biomolecular databases and search engines 
currently available, manual evaluation of the data produced is often required. 
Biological macromolecules exhibit many non-random features, most notably 
repetitive sequences and non-coding introns of genomic DNA. These typically 

35 require extensive evaluation of database matches that are found, which is a subjective, 
error-prone and tedious process. Present computational biology methods used to 
determine the number of coding sequences include promoter studies (Rainer, N. et al. 
(1999) Yeast 15:1775), codon usage (Staden, R. and McLachlan, A.D. (1982) Nucl. 
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Acids Res. 10: 141), or some combination of these methods. These procedures are 
based on current knowledge of gene function, and have a number of limitations. 

In addition, there is evidence that the current computational methods for 
assessing coding potential often fail to identify open reading frames (ORFs) that are 

5 discovered through experimental and other non-computational methods. While 

sequence similarity search programs are a quick and versatile tool, frequently able to 
identify putative coding regions, the accuracy of the present methods is often 
compromised by factors such as differential and tissue-specific splicing, genes within 
genes (i.e., polycistronic coding domains) and the need for species specific 

10 parameters. From a statistical standpoint, the accuracy of known methods is 

extremely dependent on the choice of scoring system, statistical significance of 
alignments, sequence redundancy and the masking of confounding sequence regions. 

For example, Serial Analysis of Gene Expression, or SAGE, is a technique 
designed to take advantage of high-throughput sequencing technology to obtain a 

15 profile of cellular gene expression. Essentially, the SAGE technique measures not the 
expression level of a gene, but quantifies a "tag", which represents the transcription 
product of a gene. A SAGE tag is a nucleotide sequence of a defined length, directly 
3'-adjacent to the 3'-most restriction site for a particular restriction enzyme. The data 
product of the SAGE technique is a list of tags, with their corresponding count values 

20 and thus is a digital representation of cellular gene expression. However, the SAGE 
method often sacrifices accuracy and fidelity in both the assignment of tags to genes 
as well as the ability to quantify a gene's expression level in order to increase 
throughput. 

The need for an in silico (i.e., computational) method to identify new coding 
25 genes with the speed and versatility of the presently known methods, but with 

increased accuracy and lack of bias, is increasing exponentially in conjunction with 
the increasing accumulation of known sequences. 

In addition to accurate methods, it is also important to have a model that lends 
itself well to research. In attempts to sequence and annotate the human genome, 
30 scientists have turned to the genomes of other organisms to use as models. One 
genome of one organism often used is that of the single-cell eukaryote, 
Saccharomyces cerevisiae (baker's yeast). Saccharomyces is amenable to genetic and 
biochemical manipulations, and many processes that occur in yeast also occur in 
larger eukaryotes, making yeast a model system for the study of eukaryotes, including 
35 humans. The yeast model system Saccharomyces cerevisiae was the very first 

eukaryotic genome to be completely sequenced (Goffeau, A. et al, (1996) Science 
274:546) and is the subject of intensive research. The current consensus suggests the 
number of yeast genes, which are 1 00-amino acids or longer is in the range of 6000, 
(Goffeau (1996); Mewes, H.W. et al. (1997) Nature 387(6632 Suppl):7 ; and 
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Winzeler, E. A. and Davis, R.W. (1997) Curr. Opin. Genet. Dev. 7:771, excluding a 
subset of small ORFs (Basrai, M.A. et al. (1999) Mol. Cell. Biol. 19:7041; and 
Velculescu, V. E. et al. (1997) Cell S<5:243). Recent genetic studies designed to 
catalog all genome transcripts, using SAGE technology (Velculescu, V. E. et al. 

5 ( 1 997)) and the analysis of a collection of transposon insertions (Ross-Macdonald, P. 
et al. (1999) Nature 402:413), have discovered new ORFs, which were not previously 
identified in silico. This pool of novel genes includes some putative proteins that are 
optimally shorter than 100 amino acids. However, determination of ORFs encoding 
polypeptides greater than 100 amino acids are also contemplated using the methods 

10 described herein. 

SUMMARY OF THE INVENTION 

This invention relates to a systematic in silico method to identify new coding 

sequences, including homologs of coding sequences, in S. cerevisiae and other 
15 organisms. The method of the present invention compares ORFs of a first organism 

to a comprehensive database of sequences from related organisms to identify 

homologs. The results of this method using comprehensive database searches and 

experimental studies suggest that the number of coding genes in, for example, 

S. cerevisiae, is substantially higher than currently believed. 
20 Another embodiment of the present invention comprises a method comprising 

the following steps: 

(A) collecting genomic sequence of the first organism; 

(B) identifying stop-to-stop ORFs of the first organism; 

(C) translating the stop-to-stop ORFs into polypeptide sequences; 

25 (D) comparing the polypeptide sequences of the first organism to amino acid 

translations of genomic libraries comprising genomes of other organisms; and 

(E) identifying, based on sequence identity, ORFs of the first organism that are 
present in the other organisms, wherein the identified ORFs are coding ORFs. The 
ORFs are typically determined using the start codon AUG and stop codons UAA, 

30 UAG and UGA. However, the method also contemplates genome analysis with the 
less conventional start and stop codons discussed infra. 

In one embodiment, the method comprises using BLAST with a p-value of 
less than 1 . In another embodiment, F ASTA is used, preferably with settings 
equivalent to those for BLAST with a p-value of less than 1. 

35 In another embodiment, the invention comprises a method of identifying 

ORFs in a genome of a first organism comprising the steps of: (A) collecting genomic 
sequence of the first organism; (B) comparing the genomic sequence of the first 
organism to one or more other genomic libraries comprising genomes of other 
organisms containing ORFs; and (C) determining ORFs for the first organism based 
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on the comparison. The ORFs or step B are ORFs that have been previously been 
described. 

The nucleic acid and amino acid sequences of the organism being studied may 
have at least about 20%, more preferably 25%, and more preferably at least 30% 
5 sequence identity to known sequences. 

The algorithm used would provide results equivalent to those obtained using 
BLAST wherein the p- value is less than 1 . 

The database may be a database of nucleotide sequences from a species related 
to the organism (e.g., S. cerevisiae and S. pombe) and a database of eukaryotic or 

10 prokaryotic nucleotide sequences. Specifically, the organism source of the eukaryotic 
nucleotide sequences may include, but is not limited to, primate, equine, bovine, 
caprine, ovine, porcine, feline, canine, lupine, camelid, cervidae, rodent, avian and 
ichthyes. The primate may be a human. Other organisms include vertebrates (e.g., 
mammals, birds, fish, and reptiles), invertebrates (e.g., worms), and plants. 

15 In another embodiment, the organism can be a fungus of the phylum 

oomycota, chytridiomycota, zygomycota, ascomycota, basidiomycota or 
deuteromycota. Preferably, the fungus is yeast of the phylum ascomycota. More 
preferably, the yeast is the genus Saccharomyces or Schizosaccharomyces. Most 
preferably the yeast is the species S. cerevisiae or S. pombe. 

20 The long genes are preferably about 100 or more amino acids in length. The 

smORFs preferably are less than about 100 amino acids, however, they can include 
polypeptides longer than 100 amino acids. 

The smORFs isolated as described herein can be utilized in, for example, a 
microarray. For instance, a nucleic acid microarray is fabricated by high-speed 

25 robotics, generally on glass but sometimes on nylon or silicon substrates, for which 
probes with known identity are used to determine complementary binding. These 
arrays permit massive parallel gene expression and gene discovery studies. This 
technology allows researchers to monitor the whole genome on a single chip so that 
they have a better picture of the interactions among the thousands of genes 

30 simultaneously. 

The present invention relates to smORF identified using the methods of the 
present invention, as well as a vector comprising the smORF and a cell comprising 
the vector. The cell preferably expresses the polypeptide encoded by the smORF. 
Further, the present invention relates to a nucleic acid that hybridizes to the sense or 

35 the antisense strand of the smORF, as well as an isolated polypeptide encoded by the 
smORF. 

This invention also relates to 1 19 novel coding sequences (SEQ ID NOS: 1- 
1 1 9) from the S. cerevisiae genome discovered using the methods of the instant 
invention, or fragments thereof, and optionally, a sequence required for an 
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amplification reaction. The fragment may be a primer. The invention further relates 

to an isolated polypeptide selected from the group consisting of SEQ ID NOS: 674- 

1346 and preferably SEQ ID NOS: 674-792, which appear to be expressed and in 

same instances, essential. The polypeptides should comprise at least 5 or 10 or more 

5 contiguous amino acid sequences of these sequences. 

The present invention also relates to methods of modulating the genes and 

gene products identified using an in silico method described herein and identifying 

such modulating agents. Preferred modulating agents include antibiotics, antifungals 

and antisense agents. Modulating agents are generally a compound or compositions 

10 that modulates the biological activity of a gene, its transcript or the protein(s) encoded 

by that gene. 

In another embodiment, the polypeptide or biologically active fragment 
thereof is in the form of a composition with a pharmaceutically acceptable carrier or 
excipient. 

15 The present invention further relates to antibodies and immunologically active 

fragments thereof that recognize and bind to a smORF polypeptide or fragment 
thereof. These antibodies can be human antibodies, humanized or primatized® 
antibodies, monoclonal antibodies or bispecific antibodies. A further embodiment of 
the invention includes immunologically active fragments of the antibodies, such as 

20 Fab, Fab', F(ab') 2 , Fv, scFv, and Fd. 

BRIEF DESCRIPTION OF THE DRAWINGS 

Figure 1 outlines the first steps of the strategy for new smORF identification 
using computational methods to identify new ORFs not identified by conventional 
25 methods. 

Figures 2A-2E show the experimental validation of the S. cerevisiae smORFs. 
Fig. 2A shows the control experiments demonstrating that the RNA used for the RT- 
PCR experiment was not contaminated with genomic DNA. Fig. 2B shows the 
principle behind and the results of orientation-specific RT-PCR, thus demonstrating 

30 that the transcripts observed originate from the predicted DNA strand. Figs. 2D and 
2E show more examples of transcripts detected from the smORFs. 

Figure 3 shows three yeast smORFs, which have highly conserved homologs 
in other fungi and illustrates that two have highly conserved homologs in mammalian 
species. Figure 3 shows the multiple sequence alignment of smORF18 (SEQ ID NO: 

35 677) and its homologs, smORFl 39 (SEQ ID NO: 709) and its homologs, 

andsmORF570 (SEQ ID NO: 769) and its homologs. Abbreviations: Dm, Drosophila 
melanogaster; Hs, Homo sapiens; Ce, Caenorhabditis elegans; Sc, Saccharomyces 
cerevisiae; Ca, Candida albicans; Af, Aspergillus fumigatus; An, Aspergillus 
nidulans; Sp, Schizosaccharomyces pombe; Bt, Bos taurus; and Mm, Mus musculus. 
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Residues that are identical or similar in all protein homologs are shaded in black and 
those identical or similar in two or more, but not all proteins in the alignment are 
shaded in gray. Homology shading was done with GeneDoc (Nicholas, K. B., et al. 
(1997), EMBnet News 4: 14). 

5 Figure 4 shows experimental evidence that smORF18 (SEQ ID NO: 4) codes 

for a polypeptide of the expected size. A triple HA-tag was fused to the C-terminal 
end of smORF18 using PCR, and the wild-type smORF18 gene was replaced by the 
tagged smORF18 gene by allele replacement into the chromosome. Soluble extracts 
were prepared and analyzed by Western blot analysis using monoclonal antibodies 

10 that recognize the HA epitope. Extracts from wild-type cells (lane 2) and extracts 
from two separate isolates carrying the HA-tagged smORF18 (lane 3 and 4). 

Figure 5. Human smORF18 homolog complementation of the temperature 
sensitive (ts) phenotype of the smorfl8A strain. A yeast strain with a deleted 
smORF 1 8 (smorfA) was transformed with plasmids carrying the wild-type yeast 

15 smORF 1 8 (SEQ ID NO: 4), or the human smORF 1 8 ORF under the control of the 
GAL1 promoter or empty vector. Transformants were then plated at 30°C and 37°C. 

Figure 6. Diagram of smORF57 protein interaction map. The arrows indicate 
the orientation of each two-hybrid interaction. 



20 DETAILED DESCRIPTION OF THE INVENTION 



I. Definitions 

As used herein, the term "gene" refers to the fundamental physical and 
functional unit of heredity, which carries information from one generation to the next. 
25 A gene is a segment of DNA composed of a transcribed region and regulatory 
sequences that make possible transcription of the DNA. 

As used herein, the term "organism" refers to eukaryotes and prokaryotes. 

As used herein the term "known sequence" refers to a sequence (e.g., nucleic 
acid or amino acid) of any type publicly available and annotated. 
30 As used herein, the term "long gene" refers to a gene that encodes a 

polypeptide of about 100 amino acids or more. Long genes can include genes 
encoding a polypeptide that is 100, 110, 120, 130, 140, 150, 175, 200, 300, 400, 500, 
600, 750 and 1000 amino acids long or greater. 

As used herein, the term "homolog" refers to a gene and protein coded thereby 
35 from one species with similarities to another gene and its encoded protein of the same 
species or among different species. These similarities can be based on structural (e.g., 
sequence similarity and/or three-dimensional commonality) and/or functional 
similarities (e.g., enzymatic and/or biochemical activity). 
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As used herein the term "ortholog" refers to a gene and protein encoded 
thereby from one species which corresponds to a gene and its associated protein in 
another species that is related via a common ancestral species (a homologous gene), 
but which has evolved to become different from the gene of the other species. 

5 As used herein, the term "ORF" refers to an open reading frame, which 

corresponds to a nucleotide sequence that could potentially be translated into a 
polypeptide. For the purposes of this application, an ORF may be any part of a 
coding sequence, with or without stop codons. An ORF is usually not considered to 
be an equivalent to a gene locus until an mRNA transcript for a gene product is 

10 generated. The gene product can be detected and/or the ORF's protein product has 
been identified. 

As used herein, the term "smORF" preferably refers to a small open reading 
frame that encodes a polypeptide of less than 100 amino acids. However, the methods 
of described herein can also be used to identify ORFs which encode polypeptides 

15 more than 100 amino acids long (e.g., 100, 125, 150, 200, 300, 400 500, etc. amino 
acids long). smORFs may encode a polypeptide of at least 5, 6, 7, 8, 9, 10, 11, 12, 13, 
14, 15, 16, 17, 18, 19, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95 and 
100 amino acids. Preferably, smORFs encode polypeptides of 17 or 18 to 100 amino 
acids long. The nucleic acids encoding these polypeptides accordingly include 

20 nucleic acids that are 1 5 to 300 nucleotides in length or any number of nucleotides 
between that range. The nucleic acid can be any that encodes the identified smORF 
protein, including synthetic nucleic acids and the wild-type nucleic acid. Preferred 
nucleic acids will have at least 8 contiguous nucleotides. However, other nucleic 
acids may have from 8 to 300 or more contiguous nucleotides, or any number lying 

25 within that range (e.g., 25, 75, and the like). 

As used herein, "annotation" refers to the description of the properties of a 
given sequence or gene, such as the protein encoded by the gene, function of the 
protein, its domain structure, post-translational modifications, variants, etc. 

As used herein, the term "in silico" refers to a computational method of 

30 analyzing nucleic acid and/or amino acid sequences. 

As used herein, the term "sequence identity" refers to the relatedness of two 
genetic sequences, as represented by the percentage of the amino acids and/or 
nucleotides they share. 

As used herein, the term "sequence homology" defines regions of DNA 

35 sequence, which are the same at different locations of the genome, or between 
different DNA molecules such as between the genome and a plasmid or DNA 
fragment. 

As used herein, the term "microarray" (also referred to as "biochip" and "DNA 
chip") refers to a microarray comprising nucleic acids. A microarray is fabricated by 
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high-speed robotics, generally on glass but sometimes on nylon or silicon substrates, 
for which probes with known identity are used to determine complementary binding, 
thus allowing parallel gene expression and gene discovery studies. This technology 
allows researchers to monitor the whole genome on a single chip so that they have a 
5 better picture of the interactions among the thousands of genes simultaneously. 

As used herein, the term "fragment thereof refers to an incomplete and/or 
spliced section of the smORFs of the present invention. By "biologically active" is 
meant that portion of the smORF that retains biological activity. For example, for a 
nucleic acid, it might be the activity of binding to a cognate strand. With reference to 
10 a polypeptide, by biologically active is meant that portion which is, for example 
immunogenic or has an antigenic epitope, or that has enzymatic activity. 

As used herein, the term "false positives" refers to a test result, which 
erroneously assigns the test subject to a specific group, due to insufficiently exact 
methods of testing. 

15 As used herein, the term "false negatives" refers to a test result, which 

excludes the test subject from a specific group, due to insufficiently exact methods of 
testing. 

As used herein, the term "hits" refers to when a database/computer reviews the 
information cache stored therein and finds data meeting the chosen parameters; the 
20 result is called a "hit." 

As used herein, the term "ESTs" ("expressed sequence tags") refers to a short 
strand of DNA, which is part of a cDNA. Because an EST is usually unique to a 
particular cDNA, and because cDNAs correspond to a particular gene in the genome, 
ESTs can be used to help identify unknown genes and to map their position in the 
25 genome. 

As used herein, the term "RT-PCR" refers to reverse transcriptase-polymerase 
chain reaction. In this process, mRNA is subjected to reverse transcriptase, resulting 
in the production of cDNA complementary to the mRNA. Large amounts of selected 
cDNA can then be produced by means of the polymerase chain reaction. 
30 As used herein, the term "database" refers to a large collection of genetic data 

organized especially for rapid search and retrieval by computer. 

As used herein, the term "algorithm" refers to a step-by-step procedure for 
solving a problem or accomplishing some end, especially by a computer. 
Specifically, the term "algorithm" refers to a search algorithm used to locate specific 
35 data from a genetic database. 

As used herein, the term "amplification reaction" refers to a reaction causing 
an increase in the number of copies of a specific DNA fragment, such as the 
polymerase chain reaction (PCR). 
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The polypeptide of the present invention is preferably in an isolated form. As 
used herein, the term "isolated polypeptide" refers to a polypeptide removed from its 
native environment. Thus, a polypeptide produced and contained within a 
recombinant host cell would be considered "isolated" for the purposes of the present 
5 invention. Also intended as an "isolated polypeptide" are polypeptides that have been 
purified, partially or substantially, from a recombinant host. Similarly, by "isolated 
nucleic acid" or "isolated polynucleotide" is meant a nucleic acid sequence, which is 
purified from other nucleic acid and protein contaminants. 

As used herein, the term " NrProtein database" refers to the non-redundant 
10 protein database, one of the databases available for searching using the BLAST 
algorithm. 

The present invention is directed to methods of identifying new genes in the 
genome of an organism. The method comprises the steps of removing all annotated 
ORFs and long genes from the organism's genome and then isolating small ORFs 

15 (smORFs) of preferably less than 100 amino acids. These smORFs have at least a 
20% sequence identity to all known sequences from related organisms, determined by 
searching a database using a search algorithm. The methods may further comprise the 
steps of identifying the smORFs that are coding ORFs and verifying that the smORFs 
can transcribe RNA using molecular genetics tools. 

20 The present invention is also directed to 1 1 9 novel ORFs (SEQ ID NOS : 1 - 

1 19) and their corresponding proteins (SEQ ID NOS: 674-792) from the S. cerevisiae 
genome, which were identified through the methods of the present invention as set 
froth in Table 2. The present invention is also directed to 554 other ORF sequences 
(SEQ ID NO: 120-673) and their corresponding proteins (SEQ ID NOS: 793-1346) 

25 identified in S. cereviseae using the disclosed in silico method (see Table 2). 

II. Identification of Novel Coding Sequences 

This invention relates to methods of identifying novel coding sequences in an 
organism, for example, S. cerevisiae, as well as in other prokaryotic and eukaryotic 

30 organisms. The methods of the present invention would be appropriate for use on the 
genome of any organism, including, but not limited to, plants (e.g., rice, maize, 
Aribidopsis), the plant pathogen Phytophthora, invertebrates (e.g., nematodes, higher 
worms, fruit flies, etc.), fish (e.g., zebrafish) mammals (e.g., mice, humans, etc.) and 
any of the other organisms discussed herein. 

35 One method of identifying new genes in the genome of an organism comprises 

the steps of removing annotated ORFs and long genes, preferably all known 
sequences, from the organism's genome, and then isolating small ORFs (smORFs) 
comprising nucleic acid and amino acid sequences, preferably predicted amino acid 
sequences having at least a 20% sequence identity to all known sequences, more 
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preferably amino acid sequences from related organisms, wherein percent identity is 
determined using an algorithm with parameter settings consisting essentially of or 
equivalent to a p-value of less than 1 used in conjunction with a BLAST algorithm to 
search a database of genetic information. 

Preferably, the methods of the present invention are especially adaptable for 
whole fungal genomes. More preferably, the fungus is yeast. Most preferably, the 
yeast is S. cerevisiae or C. albicans. Accordingly, one embodiment of the present 
invention is a method of identifying new genes in the genome of S. cerevisiae 
comprising the steps of removing all annotated ORFs and long genes from the S. 
cerevisiae genome, and then isolating small ORFs (smORFs) comprising predicted 
amino acid sequences having at least a 20% sequence identity to all known fungal 
amino acid sequences, wherein percent identity is determined using an algorithm. For 
example, if the algorithm is BLAST the parameters comprise a p-value of less than 1. 
Other algorithms contemplated would use parameters producing similar results as 
would be known to the artisan of ordinary skill. 

A comparison of the yeast S. cerevisiae ORFs with a comprehensive fungal 
database (excluding S. cerevisiae) suggest that most budding yeast ORFs have 
homologs in other fungi. This led to the conceptualization and validation of a new 
process for identifying novel coding sequences. For example, this would include the 
following steps: 

1. Take one nucleic acid genome of an organism to probe (e.g., 
S. cerevisiae). 

2. Collect known nucleic acid sequences (e.g., genes) of the 
genome from step 1 . 

3. Optionally remove known genes. 

4. Optionally take the portions of genome remaining after the 
above steps (known or otherwise, but not known to contain genes, e.g., 
intergenic regions). 

5. Take either intergenic region or whole genome. 

6. Identify all open reading frames (ORFs) of preferably about 
17 amino acids or longer stop-to-stop. 

7. Perform a six-frame translation (three frames forward, and 
three frames backward to correspond to the complementary strand). 

8. Look for stop codons (*). Start counting residues right after 
the stop codon to the next stop codon. Take all the sequences that are 
preferably 17 amino acids or longer and call it an ORF (stop-to-stop). 
Typically, most programs identify sequences of at least 50 to 60 amino 
acids or longer. 
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9. The novel step is then to construct a comprehensive 
database containing genomic DNA and cDNA sequences from as many 
organisms related to the subject as possible. For example, if the 
subject organism is S. cerevisiae, the database would include genomic 

5 and EST sequences from as many fungal species (excluding S. 

cerevisiae) as available in the public and/or private databases, 
including C. albicans, Aspergillus nidulans, A. fumigatus, 
Schizosaccharomyces pombe, Neurospora crassa, Cryptococcus 
neoformans, Fusarium sporotrichioid.es, etc. 

10 10. The ORFs identified in steps 7 and 8 are then compared 

against a six-frame translation of the nucleotide sequences contained in 
the database described in step 9. For example, if the organism being 
studied is S. cerevisiae, then the ORFs identified in step 6 are 
compared against the nucleotide sequences in the fungal database. 

15 Preferably, a comparison algorithm, such as TBLASTX is used. In the 

instance of TBLASTX, the parameters preferably include a p-value of 
less than 1 . Comparable algorithms with comparable parameters can 
also be utilized. 

1 1 . Compare the amino acid sequences using sequence identity 
20 parameters. 

12. Collect all the hits against entries in the database (e.g., 

fungi). 

13. A hit determines whether the ORF being studied from the 
first organism (e.g., S. cerevisiae) is likely to be a coding ORF (i.e., 

25 smORF), because it has predicted homologs in the organisms 

contained in the database (e.g., fungal database). 

A. Compilation of Organism Genome and Removal of Annotated ORFs 

For an ORF to be considered to be a good candidate for coding a cellular 

30 protein, a minimum size requirement is often set. This is not the case here. One 

novel characteristic of the present invention is that the small ORFs, which are often 
discounted in genome analysis, are considered here. 

The first step in the methods of the present invention is an examination of the 
entire genome of the organism of choice, as outlined in Fig. 1 . The sequences of the 

35 genome of choice may be found anywhere, including, but not limited to, GenBank™, 
EST sequence databases, Celera's recent human genome database (Venter et ah, "The 
Sequence of the Human Genome," Science 291: 1304-51 (2001)), and other organism 
genome databases as they are elucidated. For example, the entire S. cerevisiae 
genomic sequence (12.07 mb total) was examined, and obtained from the 
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Saccharomyces Genome Database as of December 5, 1997. (See http://genome- 
www. Stanford. edu/S accharomyces/) . 

B. The Isolation of smOKFs Using Bioinformatics 

5 The next step in the method of the claimed invention is the isolation of 

smORFs, by running the remaining ORFs obtained in the above steps against a 
database of known genes to identify any potential homologs. The database can be any 
searchable database, which can identify homologous sequences. Preferably the 
databases are compared using algorithms such as BLAST or FASTA or equivalent 

10 algorithms. 

Specifically, a method of identifying new genes in the genome of an organism 
comprises the steps of removing all annotated ORFs and long genes from the 
organism's genome. Alternatively, the removal of sequences does not need to occur. 
This is followed by isolating small ORFs (smORFs) comprising nucleic acid and 
15 amino acids sequences having at least a 20% sequence identity to all known 

sequences from related organisms. Preferably, the comparison is of amino acid 
sequences. 

The smORFs may have a sequence identity to all known sequences from 
related organisms of about 20% or more. Preferably, the sequence identity is at least 
20 about 25% sequence identity and more preferably at least about 30%^sequence 
identity. 

The first organism database searched and compared to another organism may 
comprise a plurality of known genomic nucleotide sequences and expressed sequence 
tags (ESTs). For example, the nucleic acid encoding the polypeptide sequences of the 

25 present invention are analyzed using BLAST, against any type of sequence from 
similar organism, including, but not limited to, nucleotide sequences, protein 
sequences, peptide sequences and ESTs. 

In this step, the database should be a database of nucleotide sequences from a 
species related to the organism of choice. For example, the genome of the yeast S. 

30 cerevisiae was searched against a database of all known fungal sequences. 

Alternatively, the database may be a database of all eukaryotic nucleotide sequences. 
Specifically, the organism source of the eukaryotic nucleotide sequences may include, 
but is not limited to, primate, equine, bovine, caprine, ovine, porcine, feline, canine, 
lupine, camelid, cervidae, rodent, avian and ichthyes. If a primate database is 

35 searched, the primate is preferably human. 

The long genes removed from the genome are all genes of about 1 00 or more 
amino acids. The small ORFs (smORFs), the preferred sequences of interest in the 
present invention, are sequences of typically less than 100 amino acids. However, the 
methods of the invention can be used to identify ORFs, which encode polypeptides 
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greater than 100 amino acids. One of the novel features of the instant invention is the 
focus on ORFs, which are small and therefore previously excluded or not rigorously 
studied by researchers. 

For example, in the present invention, the S. cerevisiae genome was analyzed 

5 and the nucleotide sequences of the previously identified 6,224 coding ORFs were 
removed. Next, the remaining sequences (3.45 mb) were analyzed to identify all stop- 
to-stop ORFs using a size of preferably about 17 or 1 8 residues or longer based on the 
fact that in E. coli, the overwhelming majority of genes code for proteins of preferably 
about 17 or 18 amino acids or longer (E. coli Genome Center, October 13, 1998, 

10 revision date, University of Wisconsin, Madison), http://www.genetics.wisc.edu/). 
This analysis produced approximately 140,000 ORFs, most of them shorter than 100 
residues. 

In isolating smORFs of an organism's genome, a microarray may be used. 
In one embodiment of the present invention, the ORFs thus identified were searched 

15 against a comprehensive fungal sequence database to identify any ORFs with 

potential homologs. This fungal database consisted of all NCBI entries listed under 
"fungi" (August 20, 2000, excluding any S. cerevisiae sequences), plus the genomic 
sequences from Candida albicans (Stanford University) and Aspergillus fumigatus 
(PathoGenome™ database) (A. fumigatus genomic sequences are available at 

20 http://www.LabOnWeb.com), EST sequences from Aspergillus nidulans, 

Cryptococcus neoformans, Fusarium sporotrichioides, and Neurospora crassa 
(University of Oklahoma Health Sciences Center), and Pneumocystis carinii EST 
sequences (University of Georgia). Using a cutoff score of p~>10 4 (a score of p~*10 4 
was chosen, since it is reasonably stringent for small ORFs), 1057 5". cerevisiae ORFs 

25 were identified with potential homologs in the fungal database. Preferably the p value 
when using BLAST is a value less than 1 . After removing smORFs overlapping with 
rRNA, tRNA and retrotransposon elements (i.e., TY elements), 673 smORFs were 
obtained (SEQ ID NOS: 1-673). Since homologs of these budding yeast ORFs were 
found in at least one other fungal species, it seems reasonable to predict that most of 

30 these 673 ORFs (SEQ ID NOS: 1-673) are likely to be coding ORFs (Fig. 1) as 
further described in Table 2. 

Table 2 describes the function of the genes and proteins of the present 
invention. The first column contains the smORF designation number. The nucleotide 
and amino acid sequences designated by their SEQ ID NOS are contained in the 

35 second and third columns. The corresponding length of the nucleotide and amino acid 
sequences are listed in the fourth and fifth columns, respectively. BLAST scores and 
probabilities from the described analysis herein are provided in the sixth and seventh 
columns, respectively. The description of the gene and protein is contained in the 
eighth column. The description field provides, where available, the accession number 
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(AC) or SwissProt accession number (SP), the locus name (LN), Superfamily 
classification (CL), the organism (OR), the source of variant (SR), the E.C. number 
(EC), the gene name (GN), the product name (FN), the function description (FN), the 
map position (MP), left end (LE), right end (RE), coding direction (DI), the database 
5 from which the sequence originates (DB), and the description (DE) or notes (NT) for 
each ORF. 

C. Validation of the Novel Coding Sequences 

Finally, the smORFs identified using the methods of the present invention 
10 may be validated as coding sequences able to transcribe RNA by the use of known 
experimental techniques such as reverse transcriptase-polymerase chain reaction (RT- 
PCR). A subset (i.e., 154) of the 673 smORFs (SEQ ID NOS: 1-673) were chosen 
for analysis by RT-PCR. RT-PCR analysis showed that a transcript could be 
demonstrated with 119 smORFs (SEQ ID NOS: 1-119). With regard to any smORFs 
15 identified and validated through the methods described above, the present invention 
further relates to a vector comprising such a smORF, a cell comprising the vector, a 
polypeptide encoded by the smORF and a nucleic acid which hybridizes to the sense 
or antisense strand of a smORF identified using the methods of the present invention, 
preferably under stringent conditions. 
20 Stringency is a term used in hybridization experiments to denote the degree of 

homology between the probe and the filter bound nucleic acid; the higher the 
stringency, the higher percent homology between the probe and filter bound nucleic 
acid. If the stringency is too low, unspecific hybridization may occur. If the 
stringency is too high, only a weak or no signal may be observed. For any 
25 hybridization, stringency can be varied by manipulation of three factors: temperature, 
salt concentration, and formamide concentration; however, stringent conditions are 
sequence-dependent and will differ depending on the circumstances. For example, 
longer sequences hybridize specifically at higher temperatures. Generally, highly 
stringent conditions are selected to be about 5-1 0°C lower than the thermal melting 
30 point (T J for the specific sequence at a defined ionic strength pH. Low stringency 
conditions are generally selected to be about 15-30°C below the T m . The T m is the 
temperature at which 50% of the probes complementary to the target hybridize to the 
target sequence at equilibrium. Stringent conditions will be those in which the salt 
concentration is less than about 1 .0 M sodium ion, typically about 0.01 to 1 .0 M 
35 sodium ion concentration (or other salts) at pH 7.0 to 8.3, and the temperature is at 
least about 30°C for short probes (e.g., about 10 to about 50 nucleotides) and at least 
about 60°C for long probes (e.g., greater than about 50 nucleotides). Stringent 
conditions may also be achieved with the addition of destabilizing agents such as 
formamide. 
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The degree of hybridization may also depend the amount of identity between 
the sequences. Preferably the region of identity is greater than about 5 bp, more 
preferably the region of identity is greater than 10 bp. 

Stringent hybridization conditions are known in the art and include, but are 
not limited to: (a) washing with 0.1X SSPE (0.62 M NaCl, 0.06 M NaH 2 PO + «H 2 0, 
0.075 M EDTA, pH 7.4) and 0.1% sodium dodecyl sulfate (SDS) at 50°C; (b) 
washing with 50% formamide, 5X SSC (0.75 M NaCl, 0.075 M sodium citrate), 50 
raM sodium phosphate (pH 6-8), 0.1% sodium pyrophosphate, 5X Denhardt's 
solution, sonicated salmon sperm DNA (50 ug/ml), 0.1% SDS and 10% dextran 
sulfate at 42°C, followed by washing at 42°C in 0.2X SSC and 0.1% SDS; and (c) 
washing with 0.5 M NaP0 4 , 7% SDS at 65°C followed by washing at 60°C in 0.5X 
SSC and 0. 1% SDS. High stringency hybridization conditions are those performed at 
about 20°C below the melting temperature (T m ). Preferred stringency is performed at 
about 5-1 0°C below the melting temperature (TJ. Additional hybridization 
conditions can be prepared as found in chapter 1 1 of Sambrook et al, (1989) 
Molecular Cloning: A Laboratory Manual , 2d Ed. Cold Spring Harbor Laboratory 
Press, or as would be known to the artisan of ordinary skill. 

Extensive guides to the hybridization of nucleic acids and sequence identity 
can be found in Sambrook et al, (1992) Molecular Cloning: A Laboratory Manual , 
2d Ed. Cold Spring Harbor Laboratory Press and Ausubel et al, (1995) Current 
Protocols in Molecular Biology , Greene Publishing Co., NY. 

We have developed and validated a novel method for gene identification in 
sequenced genomes and used it to identify new genes in S. cerevisiae. With this 
method, one should be able to find new coding ORFs in S. cerevisiae or other yeasts 
by simply searching potential budding yeast ORFs against other fungal species. Even 
though our experimental design was purposely non-exhaustive to demonstrate the 
proof of principle and the validity of this gene discovery process, we found strong 
evidence for several hundred new genes in the S. cerevisiae genome. For the three 
new genes selected for detailed analysis and experimental studies, we identified 
orthologs in other fungal species, as well as in other eukaryotes (e.g., mammals). 
This example can be expanded to include smORFs that partially overlap with 
annotated ORFs and smORFs that are completely located within previously annotated 
ORFs. The identification of conserved genes across a wide range of species provides 
the opportunity to use S. cerevisiae and/or other fungi to study the function of their 
counterparts in humans. In addition, the disclosed methods can be applied to other 
sequenced genomes, including humans, in order to identify coding ORFs not 
previously detected using conventional methods. This novel genome comparison 
approach to identify new ORFs will accelerate genome annotation and gene 
identification. 
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III. Novel smORF Sequences Identified 

To establish a proof of principle and verify this new method, a case study was 
done using the budding yeast genome, because it is one of the most exhaustively 

5 studied biological systems. Consequently, analysis of this genome to identify new 
genes not previously described is a rigorous test of the system, challenging the 
present methods used to identify new genes. 

The new smORFs identified using the methods described herein were then 
subjected to a validation step. A comprehensive analysis of the three smORFs was 

10 performed as a means of verifying their ability to encode a polypeptide. Most of the 
analysis was done with the Compas™ package (Genome Therapeutics Corporation), 
which performs a database search, as well as identification of such structural elements 
as motif, protein family (pfam), helix-turn-helix, coiled-coil and signal peptide to 
name a few; Compas™ also identifies protein secondary structure and predicts 

15 cellular location. We identified a wide range of homologs in other species for all 
three smORFs. SmORF18 and smORF570 have homologs in fungi and mammals 
(Fig. 3). SmORF18 also has plant homologs. Homologs of smORF139 were found 
only in fungi so far (Fig. 3). SmORF 1 8 seems to be part of a larger protein in 
Arabidopsis thaliana, Sorghum bicolor, Oryza sativa, Glycine max and other plants, 

20 but the orthologs in human, Caenorhabditis elegans, Drosophila melanogaster, and 
Schizosaccharomyces pombe are about the same length as the S. cerevisiae smORF. 

While the patches of highly conserved residues in the homologs for the three 
smORFs strongly suggest that these ORFs encode proteins, the definitive proof came 
from experimental work, wherein molecular genetics tools were used to confirm that 

25 these smORFs transcribe RNA. Primers were designed to amplify the three smORFs 
as well as the ACT1 gene (actin) control. The primers were chosen to give a PCR 
amplification product of 250 to 300 base pairs that lies inside the ORFs. Examples of 
primers for the ACT1 gene and three smORFs are shown in Table 1 . These primers 
were used for PCR amplification of S. cerevisiae Genomic DNA (template) to test the 

30 PCR amplification conditions (Yeast genomic DNA was prepared from strain W303 
using the Yeastar Genomic DNA kit (Zymo Research) as suggested by the 
manufacturer. 
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Table 1 



SillvJiVr 


JTI 1111 CI Ovt|llClll~c 


STCO TD NO 


oiiivy rvi i o 


5 '_TG A CG A A ATCG A A ATCG A AG-3 ' 






5 -G ATGCCTGCCTCTTCGTAGT-3 ' 




smORF139 


5 '-TGCCTA AG AG ATTAAGTGGGTT-3 ' 






5 '-CGTC AGTTC AGGGTGTG AAA-3 ' 

t^/ VJ X V / J. LVJ X X \ y J. \_ v J X v 1 J v J x xx ix x 




smORF570 


5'-TGTCTGCATTATTTAATTTTCGTTC-3' 






5 '- AGCTGTT A AATTG ACTG ATGGC -3 ' 




yeast ACT 1 gene 


5'-TGTCACCAACTGGGACGATA-3' 






5'-AACCAGCGTAAATTGGAACG-3' 





Products of the predicted size were obtained for all three smORFs, as well as 
the actin control (Fig. 2A, lanes 2, 6, 10, and 14). No PCR products were obtained in 

5 reactions without template (Fig. 2A, lanes 1, 5, 9, and 13), or using RNA isolated 
from S. cerevisiae grown on rich media (YEPD) or complete synthetic minimal 
(CSM) media (Fig. 2A, lanes 3, 4, 7, 8, 1 1, 12, 15, and 16). This indicates that these 
RNA samples were not contaminated with genomic DNA (RNA was isolated from 5 
X 10 7 yeast (strain W303) cells growing exponentially in YEPD or synthetic complete 

10 minimal media using the RNeasy™ Mini kit from Qiagen including a DNase (Roche) 
digestion step.) We then tested for the presence of RNA transcripts originated from 
these smORFs, as well as from the actin control using RT-PCR (RT-PCR reactions 
were done with the OneStep RT-PCR Kit from Qiagen as recommended by the 
manufacturer). Products of the expected sizes were obtained for actin, as well as all 

15 three smORFs (Fig. 2B, lanes 2, 3, 5, 6, 8, 9, 1 1, and 12). This indicates that actin 
and the three smORFs are indeed expressed in yeast cells grown in both rich and in 
minimal media. No RT-PCR product was obtained in reactions without template 
(negative control) (Fig. 2B, lanes 1, 4, 7, and 10). The identity of the RT-PCR 
products was confirmed by cloning. The RT-PCR products were isolated from an 

20 agarose gel and then cloned into pCR21-TOPO (Invitrogen), as recommended by the 
manufacturer. The sequences were then restriction mapped and dideoxy sequenced. 

To determine whether the identified smORFs were indeed transcribed from 
the predicted DNA strands, a modified RT-PCR experiment was performed. First, 
primer complementary to the predicted mRNA and the reverse transcriptase were 

25 added. After first strand cDNA synthesis, the reverse transcriptase was inactivated 
with heat. Taq polymerase and both smORF-specific primers were then added (Fig. 
2C). Under these conditions, PCR products were observed only when first strand 
synthesis was conducted with primers complementary to the predicted mRNA (lanes 
5, 6, 1 1, 12, 17 and 18). No PCR product was observed if first strand synthesis was 
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done with primers that have the same sequence as the mRNA (lanes 3, 4, 9, 10, 15 
and 16). These results indicate that the transcripts observed for smORFs 18, 139 and 
570 (SEQ ID NOS: 4, 36 and 96) are made from the predicted strand. This same 
study was extended to 151 additional smORFs, most of which have a potential 

5 homolog in the genome of C. albicans. The results show that a RT-PCR product of 
the expected size was obtained for 1 16 of these smORFs (Figs. 2D and 2E). 
Therefore, 1 19 of the 154 smORFs are transcribed from the predicted DNA strand 
(Table 2). See SEQ ID NOS: 1-119. 

To address the possibility that the observed smORF transcripts were products 

10 of read-through transcription from genes located upstream from the smORFs, the RT- 
PCR experiment was conducted using a primer complementary to the mRNA for first 
strand synthesis (Fig. 2C) and with a second primer located 400 base pairs upstream 
of the smORF. Under these conditions, no RT-PCR product was observed 
demonstrating that the smORF transcripts were not the result of read-through 

15 transcription from upstream genes. 

Functional analysis can then be performed. For example, site-directed 
mutagenesis can be performed to disrupt the function of each gene and examine the 
resulting phenotypic changes, as would be known to the artisan of ordinary skill. The 
three smORFs described here do not overlap with previously annotated ORFs and a 

20 clear start-to-stop ORF can clearly be defined. These three ORFs are not duplicated 
on the budding yeast genome, as only one copy of each ORF was identified in the 
genome. Additionally, these S. cerevisiae smORFs have highly conserved homologs 
in other fungal species (50 to 60% amino acid identity and 70 to 80% similarity). In 
the case of smORFs 18 and 570 (SEQ ID NOS: 677 and 769, respectively) highly 

25 conserved homologs could also be found in mammalian genes. 

The yeast smORFs identified using the methods described herein are 
described more fully below. 

(i) Yeast smORF 570. Comprehensive bioinformatics analysis of the yeast 
smORF570 protein sequence (SEQ ID NO: 769) suggests that this protein functions 

30 as a secreted protein. Using SigCleave (eGCG version 8), we have identified three 
overlapping signals with scores of 1 1 .6, 6.4 and 5.1, in a region that extend from 
amino acid 9 through amino acid 29, with a predicted cleavage site in the region of 
amino acids 22-27. Although TopPredll suggests the presence of two transmembrane 
domains with moderate certainty, the initial domain identified overlaps the 

35 SignalPeptide prediction noted earlier and likely represents the hydrophobicity 
associated with the SignalPeptide region. Given the presence of three conserved 
cysteine residues within the protein, which are likely to represent sites of inter- or 
intra-protein cross-linking, the second site identified by TopPredll is sub threshold 
(below a certainty cut-off of 1 .5) and is more consistent with hydrophobicity that 
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drives protein folding rather than a membrane spanning region. Taking these data 
together, our analysis would support the function of smORF570 as a secreted protein 
that could act as either a ligand, a soluble receptor or a binding protein. Based on this 
information, smORF570 would also be a target for antifungal agents and other 

5 therapeutics described herein. 

The human homolog of smORF570 maps to Chromosome 19 (19ql3.1), in a 
region with multiple olfactory receptors (AC005255, between OLFR and MEL), 
though the gene itself was not identified. The human smORF570 protein is 74% 
identical to its D. melanogaster homolog (AE003512), 39% identical to its C. elegans 

10 counterpart, and 40% identical to a novel gene expressed in human adrenal gland 
(AF 164793). EST hits for the human smORF570 homolog were found with bovine 
placenta, pig spleen lambda, mouse irradiated colon, and embryonal carcinoma cell 
line F9. Based of this information, the human homolog is most likely involved in 
cancer and could act as a target as a therapeutic target. 

15 (ii) Yeast smORF18. Of particular note is the sequence conservation (3 1%) 

share in common with the N-terminus of a chicken fas ligand receptor - soluble form 
(AF296875, 285 amino acids, p = 0.84). The number and spacing of Cys residues are 
also similar in the aligned portion of the two proteins. EST hits were found in mouse 
placenta, Beddington mouse dissected endoderm, rat kidney, rat embryo, and human 

20 placenta. 

The conservation of residues across fungi suggests that smORF18 could be 
used as an antifungal target using the methods described herein. The identity 
between human smORF18 homolog and its counterparts in D. melanogaster, C. 
elegans, A. thaliana are 70%, 69% and 60%, respectively, at amino acid residue level. 

25 SmORF18 protein is also 31% identical to Schizosaccharomyces pombe dnaj heat- 
shock protein (316 amino acids). 

To further demonstrate the validity of the method, a comprehensive analysis 
of smORFl 8 was conducted. A wide range of homologs was identified in other 
species (Fig. 3). SmORFl 8 seems to be part of a larger protein in Arabidopsis 

30 thaliana, Sorghum bicolor, Oryza sativa, Glycine max and other plants. The human, 
Caenorhabditis elegans, Drosophila melanogaster and Schizosaccharomyces pombe 
smORFl 8 homologs are about the same size as the S. cerevisiae smORFl 8 (SEQ ID 
NO: 677). SmORFl 8 (SEQ ID NO: 4) was recently annotated by Blandin et al, 
(FEBSLett. 487: 31, 2000) and assigned the systematic name YBL071W-A. 

35 Study of smORF 1 8 (SEQ ID NO: 4) was extended to determine whether a 

protein product of the appropriate size could be detected. A triple HA-tag was fused 
to the C-terminus of smORFl 8 (SEQ ID NO: 4) by PCR. First a PCR amplification 
was made using a primer corresponding to 400 bp upstream of smORFl 8 (L) and a 
second primer containing the C-terminus of smORFl 8 fused the HA-tag (5'- 
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GGAGCCTGATCCAGCGTAGTCTGGGACGTCGTATGGGTAGCCAGCGTAGT 
CTGGGACGTCGTATGGGTAGCCAGCGTAATCCGGAACATCATACGGGTAT 

C CT ACGGC AGC AGCGGC AATAGGCTC AGG-3 ') (SEQ ID NO: ). A 

second amplification was carried out with a forward primer containing the tag 5'- 
5 GTAGGATACCCGTATGATGTTCCGGATTACGCTGGCTACCCATA 

CGACGTCCCAGACTACGCTGGCTACCCATACGACGTCCCAGACTACGCTG 

GATCAGGCTCCTAAAGATGAGAGGCTAGATCGAG-3' (SEQ ID NO: ) 

and a primer located downstream of smORF18 (5*-TGTCGCTTTTTCTCCTCGATG 
AAGCCAAGCGCCGAACCAATTGATATCATCGGCACG-3') (SEQ ID NO: _). 

10 The wild-type smORF18 gene was replaced with the tagged version by allele 

replacement into the chromosome (Erdeniz et al, 1997, Genome Res. 7: 1 174). PCR 
amplification of the smORF18 (HA) 3 gene from genomic DNA followed by cloning 
and sequencing confirmed the identity of the tagged smORF18. For sequencing, PCR 
products were isolated from an agarose gel and then cloned in to pCR2.1-TOPO 

15 (Invitrogen). Soluble SI 00 extracts were prepared from diploid W303 (B.J. Thomas 
et al, 1989, Genetics 123:725) and from HA-tagged yeast cells grown in 25 ml of 
rich medium (YPD) to mid-log phase as described (Brown et al, 1996, Mol. Cell. 
Biol. 16: 5744). Soluble extracts were then fractionated in 18% polyacrylamide gels 
containing SDS. The proteins were then transferred to a PVDF membrane and the 

20 blot probed with anti-HA antibodies. The results show a protein band corresponding 
to a 9 kDa protein (Fig. 4, lanes 3 and 4) in extracts prepared from cells with a tagged 
smORF 1 8 gene and not in wild-type cells. This result demonstrates that smORF 1 8 
(SEQ ID NO: 4) is not only transcribed, but also encodes a detectable protein product 
of the predicted size. 

25 A next step of the process of identification and characterization of the gene is 

to further test if the smORF is essential. For example, one copy of the complete 
smORF 1 8 gene was deleted in a diploid yeast strain by homologous recombination. 
Cells were transformed with a PCR fragment containing the HIS3 marker flanked by 
400 bp of smORF18 sequences. The HIS3 sequence replaced amino acids 1 to 82 of 

30 smORF18. Histidine prototrophs were selected and PCR was used to verify correct 
genomic integration. Sporulation and tetrad analysis showed that haploid strains with 
a smorf]8A were able to grow at 30°C (slow growth), but not at 37°C (Fig. 5). We 
next tested if the human smORF18 is a functional homolog of the yeast smORF18. 
The human smORF18 gene, which was obtained from an EST clone, and the yeast 

35 smORF 1 8 were cloned into pYES (Invitrogen) vector for expression in yeast under 
the GAL1 promoter. The human smORF18 coding sequence was amplified from 
I.M.A.G.E. clone 1047404 (Research Genetics, Inc.). The yeast smORF18 was 
amplified from genomic DNA. PCR fragments were cloned into pYES2.1/V5-His- 
TOPO (Invitrogen). Clones were verified by sequencing and transformed into the 
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.sraor/A18strain. The resultant transformants were tested for the ability to 
complement the temperature sensitive phenotype of the smorf2A strain. The results 
demonstrate that the cloned human smORF18 as well as the yeast smORF18 (SEQ ID 
NO: 4) can complement the temperature sensitive phenotype of the smorf2A strain 

5 (Fig. 5). These results indicate that the human smORF18 is a functional ortholog of 
yeast smORF18 (SEQ ID NO: 4). The human smORF18 maps to two loci in the 
human genome, one in chromosome 3 where the gene contains two introns and codes 
for a predicted mRNA identical to the EST, and to a locus in chromosome 20 (i.e., 
20gl3.2-13.33, AL035669) without introns but with nine predicted amino acid 

10 substitutions. These data indicate that small ORFs are present and expressed in 

humans and underscores the importance of looking for small genes in the genomes of 
higher eukaryotes. smORFl 8 is essential for growth of yeast at 37 °C and has 
conserved homologs in organisms from yeast to man. smORFl 8 was used as bait in 
the two-hybrid analysis to isolate interactors. This gene is essential in yeast. 

15 (iii) Yeast smORF139 (SEQ ID NO: 36). The smORF139 protein (SEQ ID 

NO: 709) appears to be a conserved protein in fungi. However, the conserved 
sequence, "LSGLQK", is shared with lamin B2 from Xenopus laevis, chicken and 
human. The S. cerevisiae smORF139 protein is also 35% identical to an unidentified 
protein (AC003000) from Arabidopsis thaliana chromosome II (see below), and 33% 

20 identical to the middle section of glutathione transferase (S3 3 628) from Dianthus 
caryophyllus (Clove pink). SigCleave (eGCG version 8) identified a weak signal 
peptide (score 0.9) from residue 13 to 26. No transmembrane domain was found. 
The A. fumigatus version has an intron in the gene. SmORFl 39 (SEQ ID NO: 709) 
was found in the region of ade2 gene for phosphoribosylaminoimidazole carboxylase, 

25 and pheromone response protein (RGA1) in Zygosaccharomyces rouxii. smORFl 39 
(SEQ ID NO: 628) from S. cerevisiae is 74% identical to an unknown protein in 
Zygosaccharomyces rouxii. S. cerevisiae smORF139 also has a hit (38% identify) to 
a Medicago truncatula (plant) EST sequence (AW584424). 

The smORFl 39 protein (SEQ ID NO: 709) is 35% identical to "Arabidopsis 

30 thaliana protein fragment SEQ ID NO: 1495" disclosed by Ceres Inc., on 25-FEB- 
1999. The smORFl 39 is, however, conserved among fungi and therefore, could be 
used as a target for antifungal compositions described herein. 

iv. Yeast smORF57. smORF57 (SEQ ID NO: 13) is conserved between S. 
cerevisiae and C. albicans. The closest homolog in C. albicans is orf6.5842 and the 

35 following is the alignment between the two sequences: 

Score = 94 (38.1 bits), Expect = 2.2e-10, P = 2.2e-10 
Identities = 23/89 (25%) , Positives = 50/89 (56%) 

40 Sc: 4 NLSPLQQEVLDKYKQLSLDLKALDETIKELNYSQHRQQHSQQETVSPDEILQEMRDIEVK 63 
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NLSP++Q++L +Y+ ++ +L + ++ L + + ++ +++ +R +E K 

Ca : 24 NLS P I EQKILQQYQLMNNNL I KVSNELELLTNTTDEFGKGKGS S I HLVENLRQLETK 8 0 

Sc: 64 IGLVGTLLKGSVYSLILQRKQ- -EQESLG 90 
5 + V T KG+VYS++ + EQE+ G 

Ca: 81 LVFVYTFFKGAVYSILNAQDYIAEQETNG 109 



When smORF57 was used as bait three proteins were found as interactors, Dadlp, 
Damlp, and Duolp which are part of a complex of proteins that function in 

10 kinetochore function and are important for mitotic spindle integrity. (Enquist- 

Newman M. et al, 2001 Mol. Biol. Cell. 12: 2601-2613). The interactions between 
smorf57 and Dadlp, Damlp, and Duolp have been confirmed by directed testing in 
the yeast two-hybrid system. Damlp and Duolp have homologs in C. albicans, 
which are orf6.7374 and orf6.6397 respectively. (Cheeseman I.M. et al. J. Cell. Biol. 

15 152: 197-212). In addition, Dadlp has a homolog in C. albicans in Contig6-2505 
(Enquist-Newman M., et al., 2001 Mol. Biol. Cell. 12: 2601-2613). The C. albicans 
genes coding for Dadlp, Damlp, and Duolp were also used in the yeast two-hybrid 
system to analyze the interactions. A diagram indicating the confirmed interactions 
between smORF57 and Dadl, Daml, and Duol is shown in Figure 6. smORF57 also 

20 interacted with Mlplp, a non-essential (Myosin like protein 1) localized to the 

nucleus close to the nuclear envelope and the gene product from the YLR287C gene, 
which is a non-essential protein of unknown function. 

The interaction of smORF57 with the Dadl /Daml /Duol complex suggests that it 
also is involved in kinetochore function and mitotic spindle integrity. Moreover, the 

25 conservation of residues coupled with the lack of a human ortholog strongly suggests 
that smORF57 would be a target for antifungal treatment and compositions described 
herein. In addition, smORF57 would also be involved in diagnosing fungal infections 
which is also provided by this invention. 



30 smORFsl72 and 181 (SEQ ID NO: 43 and 44, respectively). These 

two smORFs also have homologs in C. albicans and the alignments 
are shown below: 

smORF172 (SEQID NO: 43) : 

35 Score = 339 (124.4 bits), Expect = 2.4e-30, P = 2.4e-30 

Identities = 63/77 (81%) , Positives = 69/77 (89%) , Frame = -3 

Query: 1 MDALNSKEQQEFQKWEQKQMKDFMRLYSNLVERCFTDCVNDFTTSKLTNKEQTCIMKCS 60 

MD LN KEQQEFQ++VEQKQMKDFM LYSNLV RCF DC VNDFT++ LT+KE +CI KCS 
40Sbj Ct : 31134 MDQLNVKEQQEFQQIVEQKQMKDFMNLYSNLVSRCFDDCVNDFTSNSLTSKETSCIAKCS 3 0955 
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Query: 61 EKFLKHSERVGQRFQEQ 77 

EKFLKHSERVGQRFQEQ 
Sbjct: 30954 EKFLKHSERVGQRFQEQ 30904 

5 smORF181 (SEQ ID NO:44): 

Score = 192 (72.6 bits), Expect = 8.8e-15, P = 8.8e-15 
Identities = 38/85 (44%) , Positives = 56/85 (65%) , Frame = +1 

Query: 10 RQVLSLYKEFI KNANQFNNYNFREYFLSKTRTTFRKNMNQQDPKVL^4NLFKEAKNDLGVL 69 
10 +Q+L LYK+ ++ A +F+NYNF+EY K TF+ N++ + +ENL+L 

Sbjct : 4054 KQILLLYKQLLEKAYKFDNYNFKEYSKRKIVETFKANKSLTNENEINQ FYNEGINQLALL 4233 

Query: 70 KRQ S VI SQMYTFDRL WE PLQGRKH 94 

15 RQ+ ISQ+YTFD+LWEPL +KH 

Sbjct: 423 4 YRQTTISQLYTFDKLWEPL- -KKH 43 02 

The smORF172 (SEQ ID NO: 43) was recently annotated (TIM9) and its gene product is 
believed to be a translocase in the inner membrane of mitochondria involved in mitochondrial 
protein import. (Leuenberger D, et al. 1999. Different import pathways through the 
20 mitochondrial intermembrane space for inner membrane proteins. EMBOJ. 18: 4816-22). 

The smORF181 is also conserved among fungal species thus implicating it as a target for 
antifungal treatment. 

v. Additional smORF Validation. 

25 To validate additional smORFs, the essentiality test was extended to 125 

smORFs (Table 4) with the following results: 



TABLE 4 



SEQ ID 


SEQ ID 
NO 


SmORF 
No. 


Essentiality Result 


SC0013 


13 


smorf057 


Confirmed essential 


SC0034 


34 


smorfl27 


Possibly essential 


SC0043 


43 


smorfl72 


Confirmed essential 


SC0044 


44 


smorfl81 


Confirmed essential 


SC0047 


47 


smorf207 


Possibly essential 


SC0052 


52 


smorf268 


Possibly essential 


SC0060 


60 


smorf303 


Possibly essential 


SC0068 


68 


smorf337 


Possibly essential 


SC0089 


89 


smorf532 


Possibly essential 


SCO 104 


104 


smorf60 1 


Possibly essential 


SC0108 


108 


smorf626 


Possibly essential 
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SEQ ID 


SEQ ID 
NO 


SmORF 
No. 


Essentiality Result 


SC0111 


111 


smorf640 


Possibly essential 


SC0184 


184 


smorfl 1 7 


Possibly essential 


SC0190 


190 


smorfl36 


Possibly essential 


SC0329 


329 


smorf330 


Possibly essential 


SC0334 


334 


smorB35 


Possibly essential 


SC0654 


654 


smorf520 


Possibly essential 


SC0572 


572 


smorf639 


Possibly essential 


SC0562 


562 


smorf623 


Possibly essential 



Three smORFs were determined to be essential (SEQ ID NO: 13, 43 and 44). 
Sixteen other sequences, which are listed in Table 4, were determined to encode 
possibly essential proteins. The remaining sequences of the 125 analyzed were 

5 determined as non-essential. The C. albicans presumptive homolog of smORF57 
(orf6.5842) was also disrupted with the result that it is essential. In addition, sixteen 
5. cerevisiae smORFs are potential essential, but essentiality needs to be confirmed 
by gene disruption in the diploid strain followed by sporulation and tetrad analysis 
(SEQ ID NO: 34, 47, 52, 60, 68, 89, 104, 108, 111, 184, 190, 329, 334, 654, 572, and 

10 562). The remaining smORFs were non-essential (Table 4). 

IV. Pharmaceutical Compositions 

Once essential genes are identified, compounds and compositions can be 
screened for their ability to modulate the activity of the gene. For example, agents 

15 can be screen for C. albicans essential genes to determine whether the compound has 
antifungal properties. Essential genes of C. albicans, for example, that do not have 
plant and/or mammalian homologs can be used as targets for the design and discovery 
of highly specific antifungal agents. Also preferred would be the identification of 
essential fungal and bacterial genes that have insect or plant homologs. Compounds 

20 and compositions that target such genes could be used as insecticides and herbicides. 
In another embodiment, essential genes which have mammalian homologs can be 
used as targets for the design of anti-proliferative agents or agents which inhibit 
proliferation or progression of the organism and/or its associated disease process. 
Candidate agents which can be used to screen and eventually to treat 

25 conditions and diseases associated with the organisms, such as C. albicans encompass 
numerous chemical classes, though typically they are organic molecules, preferably 
small organic molecules having a molecular weight of more than 100 and less than 
about 2,500 Daltons. Candidate agents are obtained from a wide variety of sources 
including libraries of synthetic or natural compounds. They can include peptides, 
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macromolecules, small molecules, chemical and/or biological mixtures, and fungal, 
bacterial, or algal extracts. Such compounds, or molecules, may be biological, 
synthetic, organic, or even inorganic compounds, and may be obtained from several 
sources, including pharmaceutical companies and specialty suppliers of libraries (e.g., 
5 combinatorial libraries) of compounds. Libraries can also include peptide libraries. 

Methods of the present invention are well suited for screening libraries of 
compounds in multiwell plates (e.g., 96-, 384-, or higher density well plates), with a 
different test compound in each well. In particular, the methods may be employed 
with combinatorial libraries. A variety of combinatorial libraries of random-sequence 

10 oligonucleotides, polypeptides, or synthetic oligomers have been proposed. A 
number of small-molecule libraries have also been developed. 

Combinatorial libraries may be formed by a variety of solution-phase or solid- 
phase methods in which mixtures of different subunits are added step-wise to growing 
oligomers or parent compounds, until a desired compound is synthesized. A library 

15 of increasing complexity can be formed in this manner, for example, by pooling 

multiple choices of reagents with each additional subunit step. Methods of preparing 
combinatorial libraries the use of microwaving, dynamic combinatorial chemistry 
(DCC), solid phase organic synthesis (SPOS), and dual recursive deconvolution 
(DRED) as example. See, e.g., Borman, "Combinatorial Chemistry", Chem. Eng. 

20 News 49-58 (Aug. 27, 2001). 

The identity of library compounds with desired effects on the target protein 
can be determined by conventional means, such as iterative synthesis methods in 
which sublibraries containing known residues in one subunit position only are 
identified as containing active compounds. 

25 Preferred compounds may have characteristics of IC 50 values between about 

15 and about 50 uM; preferably a low mammalian cellular toxicity (e.g., GI 50 >100 
uM). In the example of C. albicans, preferable compounds will have antifungal 
activity of at least about 3-50 uM against C. albicans, as well was other fungal agents 
associated with disease. Preferred antifungal agents will be those that are fungicidal, 

30 e.g., which cause the selective death of the fungus. Preferred antibiotics will cause 
the death of the fungal organism without detrimentally (e.g., causing cell death in the 
host organism infected by the fungus) affecting the condition of the host organism 
infected by the fungal organism. 

Generally, the preferred compositions and methods provided herein are 

35 directed at preventing and treating infections caused by but not limited to 

Chytridiomycetes, Hyphochrytridiomycetes, Plasmodiophoromycetes, Oomycetes, 
Zygomycetes, Ascomycetes, and Basidiomycetes. Fungal infections which can be 
inhibited or treated with compositions provided herein include but are not limited to: 
Candidiasis including but not limited to onchomycosis, chronic mucocutaneous 
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candidiasis, oral candidiasis, epiglottistis, esophagitis, gastrointestinal infections, 
genitourinary infections, for example, caused by any Candida species, including but 
not limited to Candida albicans, Candida tropicalis, Candida (Torulopsis) glabrata, 
Candida parapsilosis, Candida lusitaneae, Candida rugosa and Candida 

5 pseudotropicalis; Aspergillosis including but not limited to granulocytopenia caused 
for example, by, Aspergillus spp. including but not limited to A. fumigatus, 
Aspergillus flavus , Aspergillus niger and Aspergillus terreuis; Zygomycosis , including 
but not limited to pulmonary, sinus and rhinocerebral infections caused by, for 
example, zygomycetes such as Mucor. Rhizopus spp., Absidia, Rhizomucor, 

10 Cuiningamella, Saksenaea, Basidobolus and Conidobolus; Cryptococcosis, including 
but not limited to infections of the central nervous system — meningitis and 
infections of the respiratory tract caused by, for example, Cryptococcus neoformans; 
Trichosporonosis caused by, for example, Trichosporon beigelii; Pseudallescheriasis 
caused by, for example, Pseudallescheria boydii; Fusarium infection caused by, for 

15 example, Fusarium such as Fusarium solani, Fusarium moniliforme and Fusarium 
proliferatum; and other infections such as those caused by, for example, Penicillium 
spp. (generalized subcutaneous abscesses), Drechslera, Bipolaris, Exserohilum spp., 
Paecilomyces lilacinum, Exophila jeanselmei (cutaneous nodules), Malassezia furfur 
(folliculitis), Alternaria (cutaneous nodular lesions), Aureobasidium pullulans 

20 (splenic and disseminated infection), Rhodotorula spp. (disseminated infection), 
Chaetomium spp. (empyema), Torulopsis Candida (fungemia), Curvularia spp. 
(nasopharnygeal infection), Cunninghamella spp. (pneumonia), H. Capsulatum, B. 
dermatitidis, Coccidioides immitis, Sporothrix schenckii and Paracoccidioides 
brasiliensis, Geotrichum candidum (disseminated infection). 

25 Treating "fungal infections" as used herein refers to the treatment of 

conditions resulting from fungal infections. Therefore, contemplated is the treatment 
of, for example, pneumonia, nasopharnygeal infections, disseminated infections and 
other conditions listed above and known in the art by using the compositions 
provided herein. In preferred embodiments, treatments and sanitization of areas with 

30 the compositions provided herein can be used to treat immuno-compromised patients 
or areas where there are such patients. Wherein it is desired to identify the particular 
fungi resulting in the infection, techniques known in the art may be used. 

One of skill in the art will readily appreciate that the methods described herein 
also can be used for diagnostic applications. A diagnostic as used herein is a 

35 compound or method that assists in the identification and characterization of a health 
or disease state in humans or other animals, by a product of a gene identified by a 
disclosed method. The use of the genes and gene products thus identified are useful 
tools in vitro for fungal infection determination. 
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V. Antisense Compositions and Use Thereof 

In another embodiment, antisense compounds, compositions and methods are 
provided for modulating the expression of genes identified by the above-described 
methods. Preferable antisense compounds are those which target nucleic acids 
5 identified using a systematic in silico discovery method disclosed herein. Preferred 
antisense compounds can target, for example, SEQ ID NOS: 1-119 (See Table 2). Of 
those, most preferred are agents that target essential genes such as smORF57 (SEQ ID 
NO: 13). 

It is preferred to target specific nucleic acids for antisense. "Targeting" an 

10 antisense compound to a particular nucleic acid would preferably be to a nucleic acid 
that encodes a protein, wherein the nucleic acid is one identified by a systematic in 
silico process disclosed herein. The gene can be from a pathogenic organism. The 
targeting includes determination of a site or sites within the target gene for the 
antisense reaction (e.g., joinder of the sense and antisense strands to thereby modulate 

15 function of the gene or gene transcript). Preferred antisense compounds are those that 
recognize and bind with a site encompassing the translation initiation or termination 
codon of the open reading frame (ORF) of the gene. Since, as is known in the art, the 
translation initiation codon is typically 5'-AUG (in transcribed mRNA molecules; 5'- 
ATG in the corresponding DNA molecule), the translation initiation codon is also 

20 referred to as the "AUG codon," the "start codon" or the "AUG start codon". A 
minority of genes have a translation initiation codon having the RNA sequence 5'- 
GUG, 5'-UUG or 5*-CUG, and 5'-AUA, 5'-ACG and 5'-CUG have been shown to 
function in vivo. Thus, the terms "translation initiation codon" and "start codon" can 
encompass many codon sequences, even though the initiator amino acid in each 

25 instance is typically methionine (in eukaryotes) or formylmethionine (in prokaryotes). 

It is also known in the art that eukaryotic and prokaryotic genes may have two 
or more alternative start codons, any one of which may be preferentially utilized for 
translation initiation in a particular cell type or tissue, or under a particular set of 
conditions. In the context of the invention, "start codon" and "translation initiation 

30 codon" refer to the codon or codons that are used in vivo to initiate translation of an 
mRNA molecule transcribed from a gene encoding a protein which was identified by 
a systematic in silico method disclosed herein or one of the sequences disclosed 
herein. 

A translation termination codon (or "stop codon") of a gene's transcript may 
35 have one of three sequences, i.e., 5'-UAA, 5'-UAG and 5-UGA (the corresponding 
DNA sequences are 5'-TAA, 5'-TAG and 5-TGA, respectively). The terms "start 
codon region" and "translation initiation codon region" refer to a portion of such an 
mRNA or gene that encompasses from about 25 to about 50 contiguous nucleotides in 
either direction (i.e., 5' or 3') from a translation initiation codon. Similarly, the terms 
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"stop codon region" and "translation termination codon region" refer to a portion of 
such an mRNA or gene that encompasses from about 25 to about 50 contiguous 
nucleotides in either direction (i.e., 5' or 3') from a translation termination codon. 
Preferred antisense compositions would recognize and bind to areas containing a 

5 termination codon and/or an initiation codon of any target gene or the mRNA 
transcript it encodes. 

The open reading frame (ORF) or "coding region," which is known in the art 
to refer to the region between the translation initiation codon and the translation 
termination codon, is also a region which may be preferred targets of the antisense 

10 compounds or compositions. Other target regions include the 5' untranslated region 
(5'UTR), known in the art to refer to the portion of an mRNA in the 5' direction from 
the translation initiation codon, and thus including nucleotides between the 5' cap site 
and the translation initiation codon of an mRNA or corresponding nucleotides on the 
gene, and the 3' untranslated region (3'UTR), known in the art to refer to the portion 

15 of an mRNA in the 3' direction from the translation termination codon, and thus 
including nucleotides between the translation termination codon and 3' end of an 
mRNA or corresponding nucleotides on the gene. The 5' cap of an mRNA comprises 
an N7-methylated guanosine residue joined to the 5'-most residue of the mRNA via a 
5'- 5' triphosphate linkage. The 5' cap region of an mRNA is considered to include 

20 the 5' cap structure itself, and the first 50 nucleotides adjacent to the cap. The 5' cap 
region may also be a preferred target region for an antisense compound or 
composition. 

In the instance of more complex eukaryotic organisms, the genes are 
composed of introns and exons, with the exons containing the material that will 

25 encode the protein product of the gene. The intronic material, although transcribed 
from the gene to produce the mRNA, will be excised from the mRNA transcript prior 
to its translation into a protein. The exons are spliced together to form a continuous 
mRNA sequence. The mRNA splice sites, i.e., intron-exon junctions, may also be 
preferred target regions of antisense compounds and compositions, and are 

30 particularly useful in situations where aberrant splicing is implicated in disease, or 
where an overproduction of a particular mRNA splice product is implicated in 
disease. Aberrant fusion junctions due to rearrangements or deletions are also 
preferred targets. It has also been found that introns can also be effective, and 
therefore preferred, target regions for antisense compounds targeted, for example, to 

35 DNA or pre-mRNA. 

Once one or more target sites are identified in the genes identified using a 
systematic discovery process disclosed herein, oligonucleotides are chosen which are 
sufficiently complementary to the target, i.e., hybridize sufficiently well and with 
sufficient specificity, to result produce the desired biological outcome (e.g., inhibition 
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of microorganism proliferation or progression, inhibition and/or prevention of the 
disease or condition induced by the microorganism, modulation of the activity of the 
targeted gene). 

In the context of this invention, "hybridization" means hydrogen bonding, 

5 which may be Watson-Crick, Hoogsteen or reversed Hoogsteen hydrogen bonding, 
between complementary nucleoside or nucleotide bases. For example, adenine (A) 
and thymine (T) are complementary nucleobases, which pair through the formation of 
hydrogen bonds. "Complementary," as used herein, refers to the capacity for precise 
pairing between two nucleotides. For example, if a nucleotide at a certain position of 

10 an oligonucleotide is capable of hydrogen bonding with a nucleotide at the same 

position of a DNA or RNA molecule, then the oligonucleotide and the DNA or RNA 
are considered to be complementary to each other at that position. The 
oligonucleotide and the DNA or RNA are complementary to each other when a 
sufficient number of corresponding positions in each molecule are occupied by 

15 nucleotides which can hydrogen bond with each other. It is understood in the art that 
the sequence of an antisense compound need not be 100% complementary to that of 
its target nucleic acid to be specifically hybridizable. An antisense compound is 
specifically hybridizable when binding of the compound to the target DNA or RNA 
molecule interferes with the normal function of the target DNA or RNA to cause a 

20 loss of utility, and there is a sufficient degree of complementarity to avoid non- 
specific binding of the antisense compound or composition to non-target sequences 
under conditions in which specific binding is desired. Preferred conditions for 
specific binding are physiological conditions in the case of in vivo assays or 
therapeutic treatment, and in the case of in vitro assays, under conditions in which the 

25 assays are performed. 

Preferred antisense compounds and compositions contemplated would be for 
use as research reagents and diagnostics. For example, antisense oligonucleotides, 
which are able to inhibit gene expression, are often used by those of ordinary skill to 
elucidate the function of particular genes. Antisense compounds and compositions 

30 are also used, e.g., to distinguish between functions of various members of a 

biological pathway. Antisense modulation has, therefore, been harnessed for research 
use. 

Oligonucleotides have been employed as therapeutic moieties in the treatment 
of disease states in animals and man. It is thus established that oligonucleotides can 
35 be useful therapeutic modalities that can be configured to be useful in treatment 

regimes for treatment of cells, tissues and animals, especially humans. In the context 
of this invention, the term "oligonucleotide" refers to an oligomer or polymer of 
ribonucleic acid (RNA) or deoxyribonucleic acid (DNA) or mimetics thereof. This 
term includes oligonucleotides composed of naturally occurring nucleobases, sugars 
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and covalent internucleoside (backbone) linkages as well as oligonucleotides having 
non-naturally-occurring portions which function similarly. Such modified or 
substituted oligonucleotides are often preferred over native forms because of desirable 
properties such as, e.g., enhanced cellular uptake, enhanced affinity for nucleic acid 

5 target and increased stability in the presence of nucleases. 

While antisense oligonucleotides are a preferred form of antisense compound, 
the present invention comprehends other oligomeric antisense compounds, including 
but not limited to oligonucleotide mimetics such as are described below. The 
antisense compounds in accordance with this invention preferably comprise from 

10 about 8 to about 30 nucleobases (i.e., from about 8 to about 30 linked nucleosides). 
The antisense compounds can be longer than 30 (e.g., 35, 40, 45, 50, 55, 60, 65, 70, 
75, 80, 85, 90, 95, 100 or more as well as ranges in between). However, more 
preferred antisense compounds are comprise from about 12 to about 25 nucleobases. 
As is known in the art, a nucleoside is a base-sugar combination. The base 

15 portion of the nucleoside is normally a heterocyclic base. The two most common 
classes of such heterocyclic bases are the purines and the pyrimidines. Nucleotides 
are nucleosides that further include a phosphate group covalently linked to the sugar 
portion of the nucleoside. For those nucleosides that include a pentofuranosyl sugar, 
the phosphate group can be linked to either the 2', 3' or 5' hydroxyl moiety of the 

20 sugar. In forming oligonucleotides, the phosphate groups covalently link adjacent 
nucleosides to one another to form a linear polymeric compound. In turn, the 
respective ends of this linear polymeric structure can be further joined to form a 
circular structure. However, open linear structures are generally preferred for use as 
antisense compounds or in antisense compositions. Within the oligonucleotide 

25 structure, the phosphate groups are commonly referred to as forming the 

internucleoside backbone of the oligonucleotide. The normal linkage or backbone of 
RKA and DNA is a 3' to 5' phosphodiester linkage. 

Specific examples of preferred antisense compounds useful in this invention 
include oligonucleotides containing modified backbones or non-natural 

30 internucleoside linkages. As defined in this specification, oligonucleotides having 
modified backbones include those that retain a phosphorus atom in the backbone and 
those that do not have a phosphorus atom in the backbone. For the purposes of this 
specification, and as sometimes referenced in the art, modified oligonucleotides that 
do not have a phosphorus atom in their internucleoside backbone can also be 

35 considered to be oligonucleosides. 

Preferred modified oligonucleotide backbones for use in antisense compounds 
and compositions include, for example, phosphorothioates, chiral phosphorothioates, 
phosphorodithioates, phosphotriesters, aminoalkylphosphotriesters, methyl and other 
alkyl phosphonates including 3'-alkylene phosphonates and chiral phosphonates, 
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phosphinates, phosphoramidates including 3'-amino phosphoramidate and 
aminoalkylphosphoramidates, thionophosphoramidates, thionoalkylphosphonates, 
thionoalkylphosphotriesters, and boranophosphates having normal 3-5' linkages, 2'-5' 
linked analogs of these, and those having inverted polarity wherein the adjacent pairs 

5 of nucleoside units are linked 3'-5' to 5-3' or 2-5' to 5'-2'. Various salts, mixed salts 
and free acid forms are also included. For additional deals in preparing such 
phosphorus containing linkages, see for example, U.S. Pat. Nos.: 3,687,808; 
4,469,863; 4,476,301; 5,023,243; 5,177,196; 5,188,897; 5,264,423; 5,276,019; 
5,278,302; 5,286,717; 5,321,131; 5,399,676; 5,405,939; 5,453,496; 5,455,233; 

10 5,466,677; 5,476,925; 5,519,126; 5,536,821; 5,541,306; 5,550,111; 5,563,253; 
5,571,799; 5,587,361; and 5,625,050. 

Preferred modified oligonucleotide backbones that do not include a 
phosphorus atom may have backbones that are formed by short chain alkyl or 
cycloalkyl internucleoside linkages, mixed heteroatom and alkyl or cycloalkyl 

15 internucleoside linkages, or one or more short chain heteroatomic or heterocyclic 

internucleoside linkages. These include those having morpholino linkages (formed in 
part from the sugar portion of a nucleoside); siloxane backbones; sulfide, sulfoxide 
and sulfone backbones; formacetyl and thioformacetyl backbones; methylene 
formacetyl and thioformacetyl backbones; alkene containing backbones; sulfamate 

20 backbones; methyleneimino and methylenehydrazino backbones; sulfonate and 

sulfonamide backbones; amide backbones; and others having mixed N, O, S and CH 2 
component parts. For methods of preparing modified oligonucleotide backbones that 
lack phosphorous atoms, see, e.g., U.S. Pat. Nos.: 5,034,506; 5,166,315; 5,185,444; 
5,214,134; 5,216,141; 5,235,033; 5,264,562; 5,264,564; 5,405,938; 5,434,257; 

25 5,466,677 ; 5,470,967; 5,489,677; 5,541,307; 5,561,225; 5,596,086; 5,602,240; 
5,610,289; 5,602,240; 5,608,046; 5,610,289; 5,618,704; 5,623,070; 5,663,312; 
5,633,360; 5,677,437; and 5,677,439. 

Other preferred oligonucleotide mimetics include replacement of both the 
sugar and the internucleoside linkage, i.e., the backbone, of the nucleotide units are 

30 replaced with novel groups. The base units are maintained for hybridization with an 
appropriate nucleic acid target compound. One such oligomeric compound, an 
oligonucleotide mimetic that has been shown to have excellent hybridization 
properties, is referred to as a peptide nucleic acid (PNA). In PNA compounds, the 
sugar-backbone of an oligonucleotide is replaced with an amide containing backbone, 

35 in particular an aminoethylglycine backbone. The nucleobases are retained and are 
bound directly or indirectly to aza nitrogen atoms of the amide portion of the 
backbone. For discussion of such methods, see for example, U.S. Pat. Nos. 
5,539,082; 5,714,331; and 5,719,262 and Nielsen et al., Science, 1991, 254: 1497- 
1500. 
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Most preferred embodiments of the invention are oligonucleotides with 
phosphorothioate backbones and oligonucleosides with heteroatom backbones, and in 
particular — CH 2 — NH— O — CH 2 — , — CH 2 — N(CH 3 )— O — CH 2 — [known as a 
methylene (methylimino) or MMI backbone], — CH 2 — O — N(CH 3 ) — CH 2 — , — 

5 CH— N(CH 3 )— N(CH 3 )— CH — and — O— N(CH 3 )— CH 2 — CH 2 — [wherein the 
native phosphodiester backbone is represented as — O — P — O — CH 2 — ] and amide 
backbones such as those described in U.S. Pat. No. 5,602,240. Also preferred are 
oligonucleotides having morpholino backbone structures, such as those described in 
U.S. Pat. No. 5,034,506. 

10 Modified oligonucleotides used as antisense compounds or in antisense 

compositions as contemplated herein may also contain one or more substituted sugar 
moieties. Preferred oligonucleotides comprise one of the following at the 2' position: 
—OH; F— ; O—, S— , or N-alkyl; O— , S— , or N-alkenyl; O— , S— or N-alkynyl; or 
O-alkyl-O-alkyl, wherein the alkyl, alkenyl and alkynyl may be substituted or 

15 unsubstituted Q to C 10 alkyl or C 2 to C 10 alkenyl and alkynyl. Particularly preferred 
are 0[(CH 2 ) n 0] m CH 3 , 0(CH 2 ) n OCH 3 , 0(CH 2 ) n NH 2 , 0(CH 2 ) n CH 3 , 0(CH 2 ) n ONH 2 , and 
0(CH 2 ) n ON[(CH 2 ) n CH 3 )] 2 , where n and m are from 1 to about 10. Other preferred 
oligonucleotides may comprise one of the following at the 2' position: C\ to C 10 lower 
alkyl, substituted lower alkyl, alkaryl, aralkyl, O-alkaryl or O-aralkyl, SH, SCH 3 , 

20 OCN, CI, Br, CN, CF 3 , OCF 3 , SOCH 3 , S0 2 CH 3 , ON0 2 , N0 2 , N 3 , NH 2 , 

heterocycloalkyl, heterocycloalkaryl, aminoalkylamino, polyalkylamino, substituted 
silyl, an RNA cleaving group, a reporter group, an intercalator, a group for improving 
the pharmacokinetic properties of an oligonucleotide, or a group for improving the 
pharmacodynamic properties of an oligonucleotide, and other substituents having 

25 similar properties. A preferred modification includes 2'-methoxyethoxy (2'-0-CH 2 - 
CH 2 -OCH 3 , also known as 2*-0-(2-methoxyethyl) or 2-MOE) (Martin et al., Helv. 
Chim. Acta, 1995, 78: 486-504), i.e., an alkoxyalkoxy group. Another preferred 
modification includes 2'-dimethylaminooxyethoxy(i.e., a 0(CH 2 ) 2 ON(CH 3 ) 2 group, 
also known as 2'-DMAOE) and 2'-dimethylaminoethoxyethoxy (also known in the art 

30 as 2'-0-dimethylaminoethoxyethyl or 2'-DMAEOE). 

Other preferred modifications to the antisense compounds contemplated 
include 2*-methoxy (2'-0— CH 3 ), 2'-aminopropoxy (2'-OCH 2 CH 2 CH 2 NH 2 ) and T- 
fluoro (2'-F). Similar modifications may also be made at other positions on the 
oligonucleotide, particularly at the 3' position of the sugar on the 3' terminal 

35 nucleotide or in 2'-5' linked oligonucleotides and the 5' position of 5' terminal 
nucleotide. Oligonucleotides may also have sugar mimetics, such as cyclobutyl 
moieties in place of the pentofuranosyl sugar. For methods of preparing such 
modified sugar structures, see for example, U.S. Pat. Nos.: 4,981,957; 5,1 18,800; 
5,319,080; 5,359,044; 5,393,878; 5,446,137; 5,466,786; 5,514,785; 5,519,134; 
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5,567,811; 5,576,427; 5,591,722; 5,597,909; 5,610,300; 5,627,053; 5,639,873; 
5,646,265; 5,658,873; 5,670,633; and 5,700,920. 

Oligonucleotides may also include nucleobase (often referred to in the art 
simply as "base") modifications or substitutions. As used herein, "unmodified" or 

5 "natural" nucleobases include the purine bases adenine (A) and guanine (G), and the 
pyrimidine bases thymine (T), cytosine (C) and uracil (U). The invention also 
contemplates the use of modified nucleobases in the antisense compounds and 
compositions. Such modified nucleobases include other synthetic and natural 
nucleobases, such as 5-methylcytosine (5-me-C), 5-hydroxymethyl cytosine, 

10 xanthine, hypoxanthine, 2-aminoadenine, 6-methyl and other alkyl derivatives of 
adenine and guanine, 2-propyl and other alkyl derivatives of adenine and guanine, 2- 
thiouracil, 2-thiothymine and 2-thiocytosine, 5-halouracil and cytosine, 5-propynyl 
uracil and cytosine, 6-azo uracil, cytosine and thymine, 5-uracil (pseudouracil), 4- 
thiouracil, 8-halo, 8-amino, 8-thiol, 8-thioalkyl, 8-hydroxyl and other 8-substituted 

15 adenines and guanines, 5-halo (e.g., particularly 5-bromo, 5-trifluoromethyl) and 
other 5-substituted uracils and cytosines, 7-methylguanine and 7-methyladenine, 8- 
azaguanine and 8-azaadenine, 7-deazaguanine and 7-deazaadenine and 3- 
deazaguanine and 3-deazaadenine. Additional nucleobases would be known to the 
skilled artisan. See for example, U.S. Pat. No. 3,687,808; The CONCISE 

20 Encyclopedia Of Polymer Science And Engineering, 858-859 (Kroschwitz, J. I., 
ed. John Wiley & Sons, 1990); Englisch et al, Angewandte Chemie, v.30, p. 613 
(International Edition, 1991); and Sanghvi, Y. S., Chapter 15, Antisense Research 
and Applications, 289-302 (Crooke et al, CRC Press, 1993). Certain of these 
nucleobases are particularly useful for increasing the binding affinity of the 

25 oligomeric compounds of the invention. These include 5-substituted pyrimidines, 6- 
azapyrimidines and N-2, N-6 and 0-6 substituted purines, including 2- 
aminopropyladenine, 5-propynyluracil and 5-propynylcytosine. 5-methylcytosine 
substitutions have been shown to increase nucleic acid duplex stability by 0.6-1.2°C 
(Sanghvi, Y. S., et al, 1993) and are presently preferred base substitutions, even more 

30 particularly when combined with 2'-0-methoxyethyl sugar modifications. 

Another oligonucleotide modification contemplated for use in the antisense 
compounds and compositions involves chemically linking to the oligonucleotide one 
or more moieties or conjugates that enhance the activity, cellular distribution or 
cellular uptake of the oligonucleotide. Such moieties include but are not limited to 

35 lipid moieties such as a cholesterol moiety (Letsinger et al, Proc. Natl. Acad. Sci. 
USA, 1989, 86: 6553-6), cholic acid (Manoharan et al., Bioorg. Med. Chem. Lett., 
1994, 4: 1053-60), a thioether, e.g., hexyl-S-tritylthiol (Manoharan et al.,Ann. N.Y. 
Acad. Sci., 1992, 660: 306-9; and Manoharan et al., Bioorg. Med. Chem. Lett., 1993, 
3: 2765-70), a thiocholesterol (Oberhauser et al, Nucl. Acids Res., 1992, 20: 533-8), 
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an aliphatic chain, e.g., dodecandiol or undecyl residues (Saison-Behmoaras et al, 
EMBOJ., 1991, 10: 1111-8; Kabanov et al, FEBS Lett., 1990, 259: 327-30; and 
Svinarchuk et al, Biochimie, 1993, 75: 49-54), a phospholipid, e.g., di-hexadecyl-rac- 
glycerol or triethyl-ammonium l,2-di-0-hexadecyl-rac-glycero-3-H-phosphonate 

5 (Manoharan et al, Tetrahedron Lett., 1995, 36: 3651-4; and Shea et al, Nucl. Acids 
Res., 1990, 18: 3777-83), a polyamine or a polyethylene glycol chain (Manoharan et 
al, Nucleosides & Nucleotides, 1995, 14: 969-73), or adamantane acetic acid 
(Manoharan et al, Tetrahedron Lett., 1995, 36: 3651-4), a palmityl moiety (Mishra et 
al, Biochim. Biophys. Acta, 1995, 1264: 229-237), or an octadecylamine or 

10 hexylamino-carbonyl-oxy cholesterol moiety (Crooke et al.,J. Pharmacol. Exp. Ther., 
1996, 277: 923-937). 

Methods for preparing such oligonucleotide conjugates would be known in the 
art and include but are not limited to U.S. Pat. Nos.: 4,828,979; 4,948,882; 5,218,105; 
5,525,465; 5,541,313; 5,545,730; 5,552,538; 5,578,717, 5,580,731; 5,580,731; 

15 5,591,584; 5,109,124; 5,118,802; 5,138,045; 5,414,077; 5,486,603; 5,512,439; 
5,578,718; 5,608,046; 4,587,044; 4,605,735; 4,667,025; 4,762,779; 4,789,737; 
4,824,941; 4,835,263; 4,876,335; 4,904,582; 4,958,013; 5,082,830; 5,112,963; 
5,214,136; 5,082,830; 5,112,963; 5,214,136; 5,245,022; 5,254,469; 5,258,506; 
5,262,536; 5,272,250; 5,292,873; 5,317,098; 5,371,241, 5,391,723; 5,416,203, 

20 5,451,463 ; 5,510,475; 5,512,667; 5,514,785; 5,565,552; 5,567,810; 5,574,142; 

5,585,481; 5,587,371; 5,595,726; 5,597,696; 5,599,923; 5,599,928 and 5,688,941. 

One or more of the positions in a given compound can be modified. It is not 
necessary for all positions in a given compound to be uniformly modified, and in fact 
more than one of the aforementioned modifications may be incorporated in a single 

25 compound or even at a single nucleoside within an oligonucleotide. 

The present invention also includes antisense compounds that are chimeric 
compounds. "Chimeric" antisense compounds or "chimeras," in the context of this 
invention, are antisense compounds, particularly oligonucleotides, which contain two 
or more chemically distinct regions, each made up of at least one monomer unit, i.e., a 

30 nucleotide in the case of an oligonucleotide compound. These oligonucleotides 

typically contain at least one region wherein the oligonucleotide is modified so as to 
confer upon the oligonucleotide increased resistance to nuclease degradation, 
increased cellular uptake, and/or increased binding affinity for the target nucleic acid. 
An additional region of the oligonucleotide may serve as a substrate for enzymes 

35 capable of cleaving RNA:DNA or RNA:RNA hybrids. By way of example, RNase H 
is a cellular endonuclease that cleaves the RNA strand of an RNA:DNA duplex. 
Activation of RNase H, therefore, results in cleavage of the RNA target, thereby 
greatly enhancing the efficiency of oligonucleotide inhibition of gene expression. 
Consequently, comparable results can often be obtained with shorter oligonucleotides 
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when chimeric oligonucleotides are used, compared to phosphorothioate 
deoxyoligonucleotides hybridizing to the same target region. Cleavage of the RNA 
target can be routinely detected by gel electrophoresis and, if necessary, associated 
nucleic acid hybridization techniques known in the art. 

5 Chimeric antisense compounds of the invention may be formed as composite 

structures of two or more oligonucleotides, modified oligonucleotides, 
oligonucleosides and/or oligonucleotide mimetics as described above. Such 
compounds have are also known as hybrids or gapmers. Methods of preparing such 
hybrids include but are not limited to the teachings of U.S. Pat. Nos.: 5,013,830; 

10 5,149,797; 5,220,007; 5,256,775; 5,366,878; 5,403,711; 5,491,133; 5,565,350; 
5,623,065; 5,652,355; 5,652,356; and 5,700,922. 

The antisense compounds contemplated herein may be conveniently and 
routinely made through the well-known technique of solid phase synthesis. The 
oligonucleotides can be prepared for example using the equipment and techniques of 

15 Applied Biosystems. Any other means for such synthesis known in the art may 
additionally or alternatively be employed. 

The antisense compounds of the invention are synthesized in vitro and do not 
include antisense compositions of biological origin, or genetic vector constructs 
designed to direct the in vivo synthesis of antisense molecules. The compounds of the 

20 invention may also be admixed, encapsulated, conjugated or otherwise associated 

with other molecules, molecule structures or mixtures of compounds, as for example, 
liposomes, receptor targeted molecules, oral, rectal, topical or other formulations, for 
assisting in uptake, distribution and/or absorption. Methods and preparations for such 
uptake, distribution and/or absorption assisting formulations include, but are not 

25 limited to, U.S. Pat. Nos.: 5,108,921; 5,354,844; 5,416,016; 5,459,127; 5,521,291; 
5,543,158; 5,547,932; 5,583,020; 5,591,721; 4,426,330; 4,534,899; 5,013,556; 
5,108,921; 5,213,804; 5,227,170; 5,264,221; 5,356,633; 5,395,619; 5,416,016; 
5,417,978; 5,462,854; 5,469,854; 5,512,295; 5,527,528; 5,534,259; 5,543,152; 
5,556,948; 5,580,575; and 5,595,756. 

30 The contemplated antisense compounds and compositions disclosed herein 

also include any pharmaceutically acceptable salts, esters, or salts of such esters, or 
any other compound which, upon administration to an animal including a human, is 
capable of providing (directly or indirectly) the biologically active metabolite or 
residue thereof. Accordingly, for example, the disclosure is also drawn to prodrugs 

35 and pharmaceutically acceptable salts of the compounds of the invention, 

pharmaceutically acceptable salts of such prodrugs, and other bioequivalents. 

The term "prodrug" indicates a therapeutic agent that is prepared in an inactive 
form that is converted to an active form (i.e., drug) within the body or cells thereof by 
the action of endogenous enzymes or other chemicals and/or conditions. In particular, 
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prodrug versions of the oligonucleotides of the invention are prepared as SATE [(S- 
acetyl-2-thioethyl) phosphate] derivatives according to the methods disclosed for 
example in WO 93/24510 and in WO 94/26764. 

The term "pharmaceutically acceptable salts" refers to physiologically and 

5 pharmaceutically acceptable salts of the compounds of the invention: i.e., salts that 
retain the desired biological activity of the parent compound and do not impart 
undesired toxicological effects thereto. The compounds for modulating any of the 
disclosed genes, gene transcripts or proteins encoded thereby include antisense 
compounds as well as other modulatory compounds. 

10 Pharmaceutically acceptable base addition salts for use with antisense as well 

as other modulatory compounds are formed with metals or amines, such as alkali and 
alkaline earth metals or organic amines. Examples of metals used as cations are 
sodium, potassium, magnesium, calcium, and the like. Examples of suitable amines 
are N,N'-dibenzylethylenediamine, chloroprocaine, choline, diethanolamine, 

15 dicyclohexylamine, ethylenediamine, N-methylglucamine, and procaine (see, e.g., 
Berge et ah, "Pharmaceutical Salts," J. Pharrna. Sci., 1977, 66: 1-19). The base 
addition salts of acidic compounds are prepared by contacting the free acid form with 
a sufficient amount of the desired base to produce the salt in the conventional manner. 
The free acid form may be regenerated by contacting the salt form with an acid, and 

20 isolating the free acid in a conventional manner. The free acid forms differ from their 
respective salt forms somewhat in certain physical properties such as solubility in 
polar solvents, but otherwise the salts are equivalent to their respective free acid for 
purposes of the present invention. As used herein, a "pharmaceutical addition salt" 
includes a pharmaceutically acceptable salt of an acid form of one of the components 

25 of the compositions of the invention. These include organic or inorganic acid salts of 
the amines. Preferred acid salts are the hydrochlorides, acetates, salicylates, nitrates 
and phosphates. Other suitable pharmaceutically acceptable salts are known in the art 
and include basic salts of a variety of inorganic and organic acids, such as, for 
example, with inorganic acids (e.g., hydrochloric acid, hydrobromic acid, sulfuric 

30 acid or phosphoric acid); with organic carboxylic, sulfonic, sulfo or phospho acids or 
N-substituted sulfamic acids, for example acetic acid, propionic acid, glycolic acid, 
succinic acid, maleic acid, hydroxymaleic acid, methylmaleic acid, fumaric acid, 
malic acid, tartaric acid, lactic acid, oxalic acid, gluconic acid, glucaric acid, 
glucuronic acid, citric acid, benzoic acid, cinnamic acid, mandelic acid, salicylic acid, 

35 4-aminosalicylic acid, 2-phenoxybenzoic acid, 2-acetoxybenzoic acid, embonic acid, 
nicotinic acid or isonicotinic acid; and with amino acids, such as the 20 alpha-amino 
acids involved in the synthesis of proteins in nature, for example glutamic acid or 
aspartic acid, and also with phenylacetic acid, methanesulfonic acid, ethanesulfonic 
acid, 2-bydroxyethanesulfonic acid, ethane- 1,2-disulfonic acid, benzenesulfonic acid, 
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4-methylbenzenesulfonic acid, naphthalene-2 -sulfonic acid, naphthalene- 1,5- 
disulfonic acid, 2- or 3-phosphoglycerate, glucose-6-phosphate, N- 
cyclohexylsulfamic acid (with the formation of cyclamates), or with other acid 
organic compounds, such as ascorbic acid. 

5 Pharmaceutically acceptable salts of compounds may also be prepared with a 

pharmaceutically acceptable cation. Suitable pharmaceutically acceptable cations are 
well known in the art and include alkaline, alkaline earth, ammonium and quaternary 
ammonium cations. Carbonates or hydrogen carbonates are also possible. 

For oligonucleotides, preferred examples of pharmaceutically acceptable salts 

10 include but are not limited to (a) salts formed with cations such as sodium, potassium, 
ammonium, magnesium, calcium, polyamines such as spermine and spermidine, etc.; 
(b) acid addition salts formed with inorganic acids, for example hydrochloric acid, 
hydrobromic acid, sulfuric acid, phosphoric acid, nitric acid and the like; (c) salts 
formed with organic acids such as, for example, acetic acid, oxalic acid, tartaric acid, 

15 succinic acid, maleic acid, fumaric acid, gluconic acid, citric acid, malic acid, 

ascorbic acid, benzoic acid, tannic acid, palmitic acid, alginic acid, polyglutamic acid, 
naphthalenesulfonic acid, methanesulfonic acid, p-toluenesulfonic acid, 
naphthalenedisulfonic acid, polygalacturonic acid, and the like; and (d) salts formed 
from elemental anions such as chlorine, bromine, and iodine. 

20 The antisense compounds and other modulatory compounds described herein 

can be utilized in pharmaceutical compositions by adding an effective amount of an 
antisense compound or other modulatory compound to a suitable pharmaceutically 
acceptable diluent or carrier. Use of the compounds and methods of the invention 
may also be useful prophylactically, e.g., to prevent or delay infection, progression of 

25 the microorganism, or inflammation, for example. 

The antisense compounds of the invention are useful for research and 
diagnostics, because these compounds hybridize to nucleic acids encoding a gene 
identified using the systematic discovery technique or an mRNA transcript thereof. 
Such hybridization allows the use of sandwich and other assays to easily be 

30 constructed to exploit this fact. Hybridization of the antisense oligonucleotides of the 
invention with a nucleic acid encoding a gene or gene transcript identified by a 
systematic discover method can be detected by means known in the art. Such means 
may include conjugation of an enzyme to the oligonucleotide, radiolabelling of the 
oligonucleotide or any other suitable detection means. Kits using such detection 

35 means for detecting the level of a transcript of a gene in a sample may also be 
prepared. 

The present invention also includes pharmaceutical compositions and 
formulations that include the antisense compounds and other modulatory compounds 
and compositions of the invention. The pharmaceutical compositions of the present 
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invention may be administered in a number of ways depending upon whether local or 
systemic treatment is desired and upon the area to be treated. Administration may be 
topical (including ophthalmic and to mucous membranes including vaginal and rectal 
delivery), pulmonary (e.g., by inhalation or insufflation of powders or aerosols, 
5 including by nebulizer), intratracheal, intranasal, epidermal and transdermal, oral or 
parenteral. Parenteral administration includes intravenous (i.v.), intraarterial, 
subcutaneous (s.c), intraperitoneal (i.p.) or intramuscular (i.m.) injection or infusion; 
or intracranial (e.g., intrathecal or intraventricular) administration. Oligonucleotides 
with at least one 2'-0-methoxyethyl modification are believed to be particularly 

10 useful for oral administration 

Pharmaceutical compositions and formulations for topical administration may 
include transdermal patches, ointments, lotions, creams, gels, drops, suppositories, 
sprays, liquids and powders. Conventional pharmaceutical carriers, aqueous, powder 
or oily bases, thickeners and the like may be necessary or desirable. Coated condoms, 

15 gloves and the like may also be useful. 

Compositions and formulations for oral administration include powders or 
granules, suspensions or solutions in water or non-aqueous media, capsules, sachets 
or tablets. Thickeners, flavoring agents, diluents, emulsifiers, dispersing aids or 
binders may be desirable. 

20 Compositions and formulations for parenteral, intrathecal or intraventricular 

administration may include sterile aqueous solutions that may also contain buffers, 
diluents and other suitable additives such as, but not limited to, penetration enhancers, 
carrier compounds and other pharmaceutically acceptable carriers or excipients. 

Pharmaceutical compositions (e.g., gene, gene transcript or protein product 

25 modulatory agents as described herein) of the present invention include, but are not 
limited to, solutions, emulsions, and liposome-containing formulations. These 
compositions may be generated from a variety of components that include, but are not 
limited to, preformed liquids, self-emulsifying solids and self-emulsifying semisolids. 
The pharmaceutical formulations of the present invention, which may 

30 conveniently be presented in unit dosage form, may be prepared according to 

conventional techniques well known in the pharmaceutical industry. Such techniques 
include the step of bringing into association the active ingredients with the 
pharmaceutical carrier(s) or excipient(s). In general, the formulations are prepared by 
uniformly and intimately bringing into association the active ingredients with liquid 

35 carriers or finely divided solid carriers or both, and then, if necessary, shaping the 
product. 

The compositions of the present invention may be formulated into any of 
many possible dosage forms such as, but not limited to, tablets, capsules, liquid 
syrups, soft gels, suppositories, and enemas. The compositions of the present 
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invention may also be formulated as suspensions in aqueous, non-aqueous or mixed 
media. Aqueous suspensions may further contain substances that increase the 
viscosity of the suspension including, for example, sodium carboxymethylcellulose, 
sorbitol and/or dextran. The suspension may also contain stabilizers. 

5 In one embodiment of the present invention, the pharmaceutical compositions 

may be formulated and used as foams. Pharmaceutical foams include formulations 
such as, but not limited to, emulsions, microemulsions, creams, jellies and liposomes. 
While basically similar in nature, these formulations vary in the components and the 
consistency of the final product. The preparation of such compositions and 

10 formulations is generally known to those skilled in the pharmaceutical and 

formulation arts and may be applied to the formulation of the compositions of the 
present invention. 

The compositions of the present invention may be prepared and formulated as 
emulsions. Emulsions are typically heterogenous systems of one liquid dispersed in 

15 another in the form of droplets usually exceeding 0. 1 /im in diameter. See, e.g., 
Idson, in PHARMACEUTICAL DOSAGE FORMS v. 1, p. 199 (Lieberman, Rieger and 
Banker (Eds.), 1988, Marcel Dekker, Inc., New York); Rosoff, in PHARMACEUTICAL 
Dosage Forms, v. 1, p. 245; Block in Pharmaceutical Dosage Forms, v. 2, p. 
335; Higuchi et al., in REMINGTON'S PHARMACEUTICAL SCIENCES 301 (Mack 

20 Publishing Co., Easton, Pa., 1985). Emulsions are often biphasic systems comprising 
of two immiscible liquid phases intimately mixed and dispersed with each other. In 
general, emulsions may be either water-in-oil (w/o) or of the oil-in-water (o/w) 
variety. When an aqueous phase is finely divided into and dispersed as minute 
droplets into a bulk oily phase, the resulting composition is called a water-in-oil (w/o) 

25 emulsion. Alternatively, when an oily phase is finely divided into and dispersed as 
minute droplets into a bulk aqueous phase the resulting composition is called an oil- 
in-water (o/w) emulsion. Emulsions may contain additional components in addition 
to the dispersed phases and the active drug that may be present as a solution in either 
the aqueous phase, oily phase or itself as a separate phase. Pharmaceutical excipients 

30 such as emulsifiers, stabilizers, dyes, and anti-oxidants may also be present in 

emulsions as needed. Pharmaceutical emulsions may also be multiple emulsions that 
are comprised of more than two phases such as, for example, in the case of oil-in- 
water-in-oil (o/w/o) and water-in-oil -in- water (w/o/w) emulsions. Such complex 
formulations often provide certain advantages that simple binary emulsions do not. 

35 Multiple emulsions in which individual oil droplets of an o/w emulsion enclose small 
water droplets constitute a w/o/w emulsion. Likewise a system of oil droplets 
enclosed in globules of water stabilized in an oily continuous provides an o/w/o 
emulsion. 
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Emulsions are characterized by little or no thermodynamic stability. Often, the 
dispersed or discontinuous phase of the emulsion is well dispersed into the external or 
continuous phase and maintained in this form through the means of emulsifiers or the 
viscosity of the formulation. Either of the phases of the emulsion may be a semisolid 
5 or a solid, as is the case of emulsion-style ointment bases and creams. Other means of 
stabilizing emulsions entail the use of emulsifiers that may be incorporated into either 
phase of the emulsion. Emulsifiers may broadly be classified into four categories: 
synthetic surfactants, naturally occurring emulsifiers, absorption bases, and finely 
dispersed solids (Idson, in PHARMACEUTICAL DOSAGE FORMS v. 1, p. 199 
10 (Lieberman, Rieger and Banker (Eds.), 1988, Marcel Dekker, Inc., New York). 

Synthetic surfactants, also known as surface active agents, have found wide 
applicability in the formulation of emulsions and have been reviewed in the literature 
(Rieger, in PHARMACEUTICAL DOSAGE FORMS,v. 1, p. 285; Idson, in 
Pharmaceutical Dosage Forms, v. 1, p. 199). Surfactants are typically 
15 amphiphilic and comprise a hydrophilic and a hydrophobic portion. The ratio of the 
hydrophilic to the hydrophobic nature of the surfactant has been termed the 
hydrophile/lipophile balance (HLB) and is a valuable tool in categorizing and 
selecting surfactants in the preparation of formulations. Surfactants may be classified 
into different classes based on the nature of the hydrophilic group: nonionic, anionic, 
20 cationic and amphoteric (Rieger, in PHARMACEUTICAL DOSAGE FORMS). 

Naturally occurring emulsifiers used in emulsion formulations include lanolin, 
beeswax, phosphatides, lecithin and acacia. Absorption bases possess hydrophilic 
properties such that they can soak up water to form w/o emulsions yet retain their 
semisolid consistencies, such as anhydrous lanolin and hydrophilic petrolatum. 
25 Finely divided solids have also been used as good emulsifiers, especially in 

combination with surfactants and in viscous preparations. These include polar 
inorganic solids, such as heavy metal hydroxides, non-swelling clays (e.g., bentonite, 
attapulgite, hectorite, kaolin, montmorillonite, colloidal aluminum silicate and 
colloidal magnesium aluminum silicate), pigments and nonpolar solids (e.g., carbon 
30 or glyceryl tristearate). 

A large variety of non-emulsifying materials are also included in emulsion 
formulations and contribute to the properties of emulsions. These include fats, oils, 
waxes, fatty acids, fatty alcohols, fatty esters, humectants, hydrophilic colloids, 
preservatives and antioxidants (Block, in PHARMACEUTICAL DOSAGE FORMS, v. 1 
35 p.385 (Lieberman, Rieger and Banker (Eds.), 1988, Marcel Dekker, Inc., New York)). 

Hydrophilic colloids or hydrocolloids include naturally occurring gums and 
synthetic polymers, such as polysaccharides (e.g., acacia, agar, alginic acid, 
carrageenan, guar gum, karaya gum, and tragacanth), cellulose derivatives (e.g., 
carboxymethylcellulose and carboxypropylcellulose), and synthetic polymers (e.g., 
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carbomers, cellulose ethers, and carboxyvinyl polymers). These disperse or swell in 
water to form colloidal solutions that stabilize emulsions by forming strong interfacial 
films around the dispersed-phase droplets and by increasing the viscosity of the 
external phase. 

5 Since emulsions often contain a number of ingredients such as carbohydrates, 

proteins, sterols and phosphatides that may readily support the growth of microbes, 
these formulations often incorporate preservatives. Commonly used preservatives 
included in emulsion formulations include methyl paraben, propyl paraben, 
quaternary ammonium salts, benzalkonium chloride, esters of p-hydroxybenzoic acid, 

10 and boric acid. Antioxidants are also commonly added to emulsion formulations to 
prevent deterioration of the formulation. Antioxidants used may be free radical 
scavengers (e.g., tocopherols, alkyl gallates, butylated hydroxyanisole, butylated 
hydroxytoluene) or reducing agents (e.g., ascorbic acid and sodium metabisulfite), 
and antioxidant synergists (e.g., citric acid, tartaric acid, and lecithin). 

15 The application of emulsion formulations via dermatological, oral and 

parenteral routes and methods for their manufacture have been reviewed in the 
literature (Idson, in PHARMACEUTICAL DOSAGE FORMS, v. 1, p. 199). Emulsion 
formulations for oral delivery have been very widely used because of reasons of ease 
of formulation, efficacy from an absorption and bioavailability standpoint. (Rosoff, in 

20 Pharmaceutical Dosage Forms, v. 1 , p. 245 (Lieberman, Rieger and Banker 

(Eds.), 1988, Marcel Dekker, Inc., New York); Idson, in PHARMACEUTICAL DOSAGE 
FORMS). Mineral-oil base laxatives, oil-soluble vitamins and high fat nutritive 
preparations are among the materials that have commonly been administered orally as 
o/w emulsions. 

25 In one embodiment of the present invention, the compositions of 

oligonucleotides and nucleic acids are formulated as microemulsions. A 
microemulsion may be defined as a system of water, oil and amphiphile which is a 
single optically isotropic and thermodynamically stable liquid solution (Rosoff, in 
PHARMACEUTICAL DOSAGE FORMS, v. 1, p. 245). Typically microemulsions are 

30 systems that are prepared by first dispersing an oil in an aqueous surfactant solution 
and then adding a sufficient amount of a fourth component, generally an intermediate 
chain-length alcohol to form a transparent system. Therefore, microemulsions have 
also been described as thermodynamically stable, isotropically clear dispersions of 
two immiscible liquids that are stabilized by interfacial films of surface-active 

35 molecules (Leung and Shah, in Controlled Release of Drugs: Polymers and 

AGGREGATE Systems, 185-215 (Rosoff, M., Ed., 1989, VCH Publishers, New York). 
Microemulsions commonly are prepared via a combination of three to five 
components that include oil, water, surfactant, cosurfactant and electrolyte. Whether 
the microemulsion is of the water-in-oil (w/o) or an oil-in-water (o/w) type is 



42 



PATENT APPLICATION 
ATTY. DKT. NO.: 032796-090 

dependent on the properties of the oil and surfactant used and on the structure and 
geometric packing of the polar heads and hydrocarbon tails of the surfactant 
molecules (Schott, in Remington's Pharmaceutical Sciences, 271 (Mack 
Publishing Co., Easton, Pa., 1985). 

5 Surfactants used in the preparation of microemulsions include, but are not 

limited to, ionic surfactants, non-ionic surfactants, Brij 96, polyoxyethylene oleyl 
ethers, polyglycerol fatty acid esters, tetraglycerol monolaurate (ML310), 
tetraglycerol monooleate (MO310), hexaglycerol monooleate (PO310), hexaglycerol 
pentaoleate (PO500), decaglycerol monocaprate (MCA750), decaglycerol monooleate 

10 (MO750), decaglycerol sequioleate (SO750), decaglycerol decaoleate (DAO750), 
alone or in combination with co-surfactants. The co-surfactant, usually a short-chain 
alcohol such as ethanol, 1-propanol, and 1-butanol, serves to increase the interfacial 
fluidity by penetrating into the surfactant film and consequently creating a disordered 
film because of the void space generated among surfactant molecules. 

15 Microemulsions may, however, be prepared without the use of co-surfactants 

and alcohol-free self-emulsifying microemulsion systems are known in the art. The 
aqueous phase may typically be, but is not limited to, water, an aqueous solution of 
the drug, glycerol, PEG300, PEG400, polyglycerols, propylene glycols, and 
derivatives of ethylene glycol. The oil phase may include, but is not limited to, 

20 materials such as Captex 300, Captex 355, Capmul MCM, fatty acid esters, medium 
chain (C 8 -C 12 ) mono-, di-, and tri-glycerides, polyoxyethylated glyceryl fatty acid 
esters, fatty alcohols, polyglycolized glycerides, saturated polyglycolized C 8 -C 10 
glycerides, vegetable oils and silicone oil. 

Microemulsions are particularly of interest from the standpoint of drug 

25 solubilization and the enhanced absorption of drugs. Lipid based microemulsions 
(both o/w and w/o) have been proposed to enhance the oral bioavailability of drugs, 
including peptides (Constantinides et al, Pharm. Res., 1994, 1 1:1385-90; Ritschel, 
Meth. Find. Exp. Clin. Pharmacol., 1993, 13: 205). Microemulsions afford 
advantages of improved drug solubilization, protection of drug from enzymatic 

30 hydrolysis, possible enhancement of drug absorption due to surfactant-induced 

alterations in membrane fluidity and permeability, ease of preparation, ease of oral 
administration over solid dosage forms, improved clinical potency, and decreased 
toxicity (Constantinides et al, 1994; Ho etal.,J. Pharm. Sci., 1996, 85: 138-143). 
Often microemulsions may form spontaneously when their components are brought 

35 together at ambient temperature. This may be particularly advantageous when 

formulating thermolabile drugs, peptides or oligonucleotides. Microemulsions have 
also been effective in the transdermal delivery of active components in both cosmetic 
and pharmaceutical applications. It is expected that the microemulsion compositions 
and formulations of the present invention will facilitate the increased systemic 
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absorption of oligonucleotides and nucleic acids and other active agents from the 

gastrointestinal tract, as well as improve the local cellular uptake of oligonucleotides 

and nucleic acids and other active agents within the gastrointestinal tract, vagina, 

buccal cavity and other areas of administration. 

5 Microemulsions of the present invention may also contain additional 

components and additives such as sorbitan monostearate (Grill 3), Labrasol, and 
penetration enhancers to improve the properties of the formulation and to enhance the 
absorption of the oligonucleotides and nucleic acids of the present invention. 
Penetration enhancers used in the microemulsions of the present invention may be 

10 classified as belonging to one of five broad categories — surfactants, fatty acids, bile 
salts, chelating agents, and non-chelating non-surfactants (Lee et al. , Crit. Rev. 
Therap. Drug Carrier Systems, 1991, p. 92). Each of these classes has been discussed 
above. 

There are many organized surfactant structures besides microemulsions that 
15 have been studied and used for the formulation of drugs. These include monolayers, 
micelles, bilayers and vesicles. Vesicles, such as liposomes, are useful because of 
their specificity and the duration of action. As used in the present invention, the term 
"liposome" means a vesicle composed of amphiphilic lipids arranged in a spherical 
bilayer or bilayers. 

20 Liposomes are unilamellar or multilamellar vesicles which have a membrane 

formed from a lipophilic material and an aqueous interior. The aqueous portion 
contains the composition to be delivered. Cationic liposomes possess the advantage 
of being able to fuse to the cell wall. Non-cationic liposomes, although not able to 
fuse as efficiently with the cell wall, are taken up by macrophages in vivo. Selection 

25 of the appropriate liposome depending on the agent to be encapsulated would be 
evident given what is known in the art. 

In order to cross mammalian skin, lipid vesicles must pass through a series of 
fine pores, each with a diameter less than 50 nm, under the influence of a suitable 
transdermal gradient. Therefore, it is desirable to use a liposome that is highly 

30 deformable and able to pass through such fine pores. 

Further advantages of liposomes include: (a) liposomes obtained from natural 
phospholipids are biocompatible and biodegradable; (b) liposomes can incorporate a 
wide range of water and lipid soluble drugs; (c) liposomes can protect encapsulated 
drugs in their internal compartments from metabolism and degradation (Rosoff, in 

35 PHARMACEUTICAL DOSAGE FORMS). Important considerations in the preparation of 
liposome formulations are the lipid surface charge, vesicle size and the aqueous 
volume of the liposomes. 

Liposomes are useful for the transfer and delivery of active ingredients to the 
site of action. Because the liposomal membrane is structurally similar to biological 
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membranes, when liposomes are applied to a tissue, the liposomes start to merge with 
the cellular membranes. As the merging of the liposome and cell progresses, the 
liposomal contents are emptied into the cell where the active agent may act. 

Another embodiment also contemplates the use of liposomes for topical 
5 administration. Such advantages include reduced side-effects related to high systemic 
absorption of the administered drug, increased accumulation of the administered drug 
at the desired target, and the ability to administer a wide variety of drugs, both 
hydrophilic and hydrophobic, into the skin. Several reports have detailed the ability 
of liposomes to deliver agents including high-molecular weight DNA into the skin. 
10 Compounds including analgesics, antibodies, hormones and high-molecular weight 
DNAs have been administered to the skin. The majority of applications resulted in the 
targeting of the upper epidermis. 

Liposomes fall into two broad classes. Cationic liposomes are positively 
charged liposomes that interact with the negatively charged DNA molecules to form a 
15 stable complex. The positively charged DNA/liposome complex binds to the 

negatively charged cell surface and is internalized in an endosome. Due to the acidic 
pH within the endosome, the liposomes are ruptured, releasing their contents into the 
cell cytoplasm (Wang et al, Biochem. Biophys. Res. Comm., 1987, 147:, 980-5). 

Liposomes that are pH-sensitive or negatively-charged, entrap DNA rather 
20 than complex with it. Since both the DNA and the lipid are similarly charged, 
repulsion rather than complex formation occurs. Nevertheless, some DNA is 
entrapped within the aqueous interior of these liposomes. pH-sensitive liposomes 
have been used to deliver DNA encoding the thymidine kinase gene to cell 
monolayers in culture. Expression of the exogenous gene was detected in the target 
25 cells (Zhou et al.,J. Controlled Release, 1992, 19: 269-74). 

Another contemplated liposomal composition includes phospholipids other 
than naturally-derived phosphatidylcholine. Neutral liposome compositions, for 
example, can be formed from dimyristoyl phosphatidylcholine (DMPC) or 
dipalmitoyl phosphatidylcholine (DPPC). Anionic liposome compositions generally 
30 are formed from dimyristoyl phosphatidylglycerol, while anionic fusogenic liposomes 
are formed primarily from dioleoyl phosphatidylethanolamine (DOPE). Another type 
of liposomal composition is formed from phosphatidylcholine (PC) such as, for 
example, soybean PC, and egg PC. Another type is formed from mixtures of 
phospholipid and/or phosphatidylcholine and/or cholesterol. 
35 "Sterically stabilized" liposomes that refer to liposomes comprising one or 

more specialized lipids that, when incorporated into liposomes, result in enhanced 
circulation lifetimes relative to liposomes lacking such specialized lipids are also 
contemplated. Examples of sterically stabilized liposomes are those in which part of 
the vesicle-forming lipid portion of the liposome (A) comprises one or more 
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glycolipids, such as monosialoganglioside G M1 , or (B) is derivatized with one or more 
hydrophilic polymers, such as a polyethylene glycol (PEG) moiety. While not 
wishing to be bound by any particular theory, it is thought in the art that, at least for 
sterically stabilized liposomes containing gangliosides, sphingomyelin, or PEG- 

5 derivatized lipids, the enhanced circulation half-life of these sterically stabilized 
liposomes derives from a reduced uptake into cells of the reticuloendothelial system 
(RES) (Allen et al, FEBS Lett. , 1987, 223: 42; Wu et al, Can. Res., 1993, 53: 3765). 

Many liposomes comprising lipids derivatized with one or more hydrophilic 
polymers, and methods of preparation thereof, are known in the art. See, e.g., 

10 Sunamoto et al. {Bull. Chem. Soc. Jpn., 1980, 53: 2778) described liposomes 

comprising a nonionic detergent, 2C 12 15G, that contains a PEG moiety. Ilium et al. 
(FEBS Lett., 1984, 167: 79) noted that hydrophilic coating of polystyrene particles 
with polymeric glycols results in significantly enhanced blood half-lives. Synthetic 
phospholipids modified by the attachment of carboxylic groups of polyalkylene 

15 glycols (e.g., PEG) are described by Sears (U.S. Pat. Nos. 4,426,330 and 4,534,899). 
Klibanov et al. (FEBS Lett., 1990, 268: 235) described experiments demonstrating 
that liposomes comprising phosphatidylethanolamine (PE) derivatized with PEG or 
PEG stearate have significant increases in blood circulation half-lives. Blume et al. 
(Biochimica et Biophysica Acta, 1990, 1029: 91) extended such observations to other 

20 PEG-derivatized phospholipids, e.g., DSPE-PEG, formed from the combination of 
distearoylphosphatidylethanolamine (DSPE) and PEG. Liposomes having covalently 
bound PEG moieties on their external surface are described in European Patent No. 
EP 0 445 131 Bl and WO 90/04384 to Fisher. Liposome compositions containing 1- 
20 mole percent of PE derivatized with PEG, and methods of use thereof, are 

25 described by, e.g., Woodle et al. (U.S. Pat. Nos. 5,013,556 and 5,356,633) and Martin 
et al. (U.S. Pat. No. 5,213,804 and European Patent No. EP 0 496 813 Bl). 
Liposomes comprising a number of other lipid-polymer conjugates are disclosed in 
WO 91/05545 and U.S. Pat. No. 5,225,212 (both to Martin et al.) and in WO 
94/20073 (Zalipsky et al). Liposomes comprising PEG-modified ceramide lipids are 

30 described in WO 96/10391 (Choi et al). U.S. Pat. No. 5,540,935 (Miyazaki et al.) 
and U.S. Pat. No. 5,556,948 (Tagawa et al) describe PEG-containing liposomes that 
can be further derivatized with functional moieties on their surfaces. 

Methods of encapsulating nucleic acids in liposomes are also known in the art. 
See, WO 96/40062 to Thierry et al. discloses methods for encapsulating high 

35 molecular weight nucleic acids in liposomes. U.S. Pat. No. 5,264,221 to Tagawa et 
al. discloses protein-bonded liposomes and asserts that the contents of such liposomes 
may include an antisense RNA. U.S. Pat. No. 5,665,710 to Rahman et al. describes 
certain methods of encapsulating oligodeoxynucleotides in liposomes. 
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Surfactants find wide application in formulations such as emulsions (including 
microemulsions) and liposomes. The most common way of classifying and ranking 
the properties of the many different types of surfactants, both natural and synthetic, is 
by the use of the hydrophile/lipophile balance (HLB). The nature of the hydrophilic 

5 group (also known as the "head") provides the most useful means for categorizing the 
different surfactants used in formulations (Rieger, in PHARMACEUTICAL DOSAGE 
Forms, p.285 (Marcel Dekker, Inc., New York, N.Y., 1988, p. 285)). 

If the surfactant molecule is not ionized, it is classified as a nonionic 
surfactant. Nonionic surfactants find wide application in pharmaceutical and 

10 cosmetic products and are usable over a wide range of pH values. In general, their 
HLB values range from 2 to about 1 8 depending on their structure. Nonionic 
surfactants include nonionic esters such as ethylene glycol esters, propylene glycol 
esters, glyceryl esters, polyglyceryl esters, sorbitan esters, sucrose esters, and 
ethoxylated esters. Nonionic alkanolamides and ethers such as fatty alcohol 

1 5 ethoxylates, propoxylated alcohols, and ethoxylated/propoxylated block polymers are 
also included in this class. The polyoxyethylene surfactants are the most popular 
members of the nonionic surfactant class. 

If the surfactant molecule carries a negative charge when it is dissolved or 
dispersed in water, the surfactant is classified as anionic. Anionic surfactants include 

20 carboxylates such as soaps, acyl lactylates, acyl amides of amino acids, esters of 

sulfuric acid such as alkyl sulfates and ethoxylated alkyl sulfates, sulfonates such as 
alkyl benzene sulfonates, acyl isethionates, acyl taurates and sulfosuccinates, and 
phosphates. The most important members of the anionic surfactant class are the alkyl 
sulfates and the soaps. 

25 If the surfactant molecule carries a positive charge when it is dissolved or 

dispersed in water, the surfactant is classified as cationic. Cationic surfactants include 
quaternary ammonium salts and ethoxylated amines. The quaternary ammonium salts 
are the most used members of this class. 

If the surfactant molecule has the ability to carry either a positive or negative 

30 charge, the surfactant is classified as amphoteric. Amphoteric surfactants include 
acrylic acid derivatives, substituted alkylamides, N-alkylbetaines and phosphatides. 

The use of surfactants in drug products, formulations and in emulsions has 
been reviewed (Rieger, in Pharmaceutical Dosage Forms, 285 (Marcel Dekker, 
Inc., New York, NY., 1988). 

35 In one embodiment, the present invention employs various penetration 

enhancers to affect the efficient delivery of nucleic acids and other agents, particularly 
oligonucleotides, to the skin of animals. Most drugs are present in solution in both 
ionized and nonionized forms. However, usually only lipid soluble or lipophilic 
drugs readily cross cell membranes. It has been discovered that even non-lipophilic 
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drugs may cross cell membranes if the membrane to be crossed is treated with a 

penetration enhancer. In addition to aiding the diffusion of non-lipophilic drugs 

across cell membranes, penetration enhancers also enhance the permeability of 

lipophilic drugs. 

5 Penetration enhancers may be classified as belonging to one of five broad 

categories, i.e., surfactants, fatty acids, bile salts, chelating agents, and non-chelating 
non-surfactants (Lee et al., Critical Reviews in Therapeutic Drug Carrier Systems, 
1991, p.92). Each of the above mentioned classes of penetration enhancers are 
described below in greater detail. 

10 Another embodiment of the invention contemplates pharmaceutical 

compositions comprising surfactants. Surfactants (or "surface-active agents") are 
chemical entities which, when dissolved in an aqueous solution, reduce the surface 
tension of the solution or the interfacial tension between the aqueous solution and 
another liquid, with the result that absorption of oligonucleotides through the mucosa 

15 is enhanced. In addition to bile salts and fatty acids, these penetration enhancers 
include, for example, sodium lauryl sulfate, polyoxyethylene-9-lauryl ether and 
polyoxyethylene-20-cetyl ether) (Lee et al, Crit. Rev. Therap. Drug Carrier Systems, 
1991, 92); and perfluorochemical emulsions, such as FC-43 (Takahashi et al, J. 
Pharm. Pharmacol., 1988, 40: 252). 

20 Another embodiment contemplates the use of various fatty acids and their 

derivatives to act as penetration enhancers include, for example, oleic acid, lauric 
acid, capric acid (n-decanoic acid), myristic acid, palmitic acid, stearic acid, lin oleic 
acid, linolenic acid, dicaprate, tricaprate, monoolein (1-monooleoyl-rac-glycerol), 
dilaurin, caprylic acid, arachidonic acid, glycerol 1-monocaprate, 1- 

25 dodecylazacycloheptan-2-one, acylcarnitines, acyl cholines, Ci_ 10 alkyl esters thereof 
(e.g., methyl, isopropyl and t-butyl), and mono- and di-glycerides thereof (i.e., oleate, 
laurate, caprate, myristate, palmitate, stearate, linoleate, and the like) (Lee et al. , 
1991; Muranishi, Crit. Rev. Therap. Drug Carrier Systems, 1990, 7: 1-33; El Hariri et 
al.,J. Pharm. Pharmacol., 1992, 44: 651-4). 

30 The compositions comprising the active agents of the invention may further 

comprise bile salts. The physiological role of bile includes the facilitation of 
dispersion and absorption of lipids and fat-soluble vitamins (Brunton, Chapter 38 in: 
Goodman & Gilman's The Pharmacological Basis of Therapeutics, 9th Ed., 
Hardman et al. Eds., McGraw-Hill, N.Y., 1996, pp. 934-935). Various natural bile 

35 salts, and their synthetic derivatives, act as penetration enhancers. Thus, the term 

"bile salts" includes any of the naturally occurring components of bile as well as any 
of their synthetic derivatives. The bile salts of the invention include, for example, 
cholic acid (or its pharmaceutically acceptable sodium salt, sodium cholate), 
dehydrocholic acid (sodium dehydrocholate), deoxycholic acid (sodium 
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deoxycholate), glucholic acid (sodium glucholate), glycholic acid (sodium 
glycocholate), glycodeoxycholic acid (sodium glycodeoxycholate), taurocholic acid 
(sodium taurocholate), taurodeoxycholic acid (sodium taurodeoxycholate), 
chenodeoxycholic acid (sodium chenodeoxycholate), ursodeoxycholic acid (UDCA), 
sodium tauro-24,25-dihydro-fusidate (STDHF), sodium glycodihydrofusidate and 
polyoxyethylene-9-lauryl ether (POE) (Lee et al, 1991; Swinyard, Chapter 39 In: 
REMINGTON'S PHARMACEUTICAL SCIENCES, 18th Ed., Gennaro, ed., Mack Publishing 
Co., Easton, Pa., 1990, pages 782-783; Muranishi, 1990; Yamamoto et al, J. Pharm. 
Exp. Ther., 1992, 263: 25; Yamashita et al., J. Pharm. Sci., 1990, 79: 579-83). 

The invention further contemplates compositions comprising chelating agents. 
Chelating agents can be defined as compounds that remove metallic ions from 
solution by forming complexes therewith, with the result that absorption of 
oligonucleotides through the mucosa is enhanced. With regards to their use as 
penetration enhancers for use when the active agent is an antisense agent, chelating 
agents have the added advantage of also serving as DNase inhibitors, as most 
characterized DNA nucleases require a divalent metal ion for catalysis and are thus 
inhibited by chelating agents (Jarrett, J. Chromatogr., 1993, 618: 315-39). Chelating 
agents of the invention include but are not limited to disodium 
ethylenediaminetetraacetate (EDTA), citric acid, salicylates (e.g., sodium salicylate, 
5-methoxysalicylate and homovanilate), N-acyl derivatives of collagen, laurefh-9 and 
N-amino acyl derivatives of beta-diketones (enamines) (Lee et al, 1991; Muranishi, 
1990; Buur et al, J. Control ReL, 1990, 14: 43-51). 

The invention also contemplates pharmaceutical compositions comprising 
active agents and non-chelating non-surfactants. Non-chelating non-surfactant 
penetration enhancing compounds can be defined as compounds that demonstrate 
insignificant activity as chelating agents or as surfactants, but that nonetheless 
enhance absorption of oligonucleotides through the alimentary mucosa (Muranishi, 
1990). This class of penetration enhancers include, for example, unsaturated cyclic 
ureas, 1-alkyl- and 1-alkenylazacyclo-alkanone derivatives (Lee et al., 1991); and 
non-steroidal anti-inflammatory agents such as diclofenac sodium, indomethacin and 
phenylbutazone (Yamashita et al, J. Pharm. Pharmacol, 1987, 39: 621-6). 

For pharmaceutical compositions comprising oligonucleotides, agents that 
enhance uptake of oligonucleotides at the cellular level may also be added to the 
pharmaceutical and other compositions of the present invention. For example, 
cationic lipids, such as lipofectin (Junichi et al, U.S. Pat. No. 5,705,188), cationic 
glycerol derivatives, and polycationic molecules, such as polylysine (Lollo et al. , PCT 
Application WO 97/30731), are also known to enhance the cellular uptake of 
oligonucleotides. 
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Other agents may be utilized to enhance the penetration of the administered 
nucleic acids, including glycols such as ethylene glycol and propylene glycol, pyrrols 
such as 2-pyrrol, azones, and terpenes such as limonene and menthone. 

Certain compositions of the present invention also incorporate carrier 

5 compounds in the formulation. As used herein, "carrier compound" or "carrier" can 
refer to a nucleic acid, or analog thereof, which is inert (i.e., does not possess 
biological activity per se) but is recognized as a nucleic acid by in vivo processes that 
reduce the bioavailability of a nucleic acid having biological activity by, for example, 
degrading the biologically active nucleic acid or promoting its removal from 

10 circulation. The coadministration of a nucleic acid and a carrier compound, typically 
with an excess of the latter substance, can result in a substantial reduction of the 
amount of nucleic acid recovered in the liver, kidney or other extracirculatory 
reservoirs, presumably due to competition between the carrier compound and the 
nucleic acid for a common receptor. For example, the recovery of a partially 

15 phosphorothioate oligonucleotide in hepatic tissue can be reduced when it is 
coadministered with polyinosinic acid, dextran sulfate, polycytidic acid or 4- 
acetamido-4'isothiocyano-stilbene-2,2'-disulfonic acid (Miyao et al., Antisense Res. 
Dev., 1995, 5: 115-121; Takakura et al, Antisense & Nucl. Acid Drug Dev., 1996, 6: 
177-183). 

20 The pharmaceutical compositions disclosed herein may also comprise one or 

more pharmaceutically acceptable excipients. In contrast to carrier compounds 
described above, these excipients include a pharmaceutically acceptable solvent, 
suspending agent or any other pharmacologically inert vehicle for delivering one or 
more nucleic acids or other active agents to an animal. The excipient may be liquid 

25 or solid and is selected, with the planned manner of administration in mind, so as to 
provide for the desired bulk, consistency, etc., when combined with a nucleic acid or 
other active agent and the other components of a given pharmaceutical composition. 
Typical pharmaceutical carriers include, but are not limited to, binding agents (e.g., 
pregelatinized maize starch, polyvinylpyrrolidone or hydroxypropyl methylcellulose, 

30 etc.); fillers (e.g., lactose and other sugars, microcrystalline cellulose, pectin, gelatin, 
calcium sulfate, ethyl cellulose, polyacrylates or calcium hydrogen phosphate, etc.); 
lubricants (e.g., magnesium stearate, talc, silica, colloidal silicon dioxide, stearic acid, 
metallic stearates, hydrogenated vegetable oils, corn starch, polyethylene glycols, 
sodium benzoate, sodium acetate, etc.); disintegrants (e.g., starch, sodium starch 

35 glycolate, etc.); and wetting agents (e.g., sodium lauryl sulphate, etc.). 

Pharmaceutically acceptable organic or inorganic excipients suitable for non- 
parenteral administration, which do not deleteriously react with nucleic acids, can 
also be used to formulate the compositions of the present invention. Suitable 
pharmaceutically acceptable carriers include, but are not limited to, water, salt 
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solutions, alcohols, polyethylene glycols, gelatin, lactose, amylose, magnesium 
stearate, talc, silicic acid, viscous paraffin, hydroxymethylcellulose, 
polyvinylpyrrolidone and the like. 

Formulations for topical administration of nucleic acids and other 
contemplated active agents may include sterile and non-sterile aqueous solutions, 
non-aqueous solutions in common solvents such as alcohols, or solutions of the 
nucleic acids in liquid or solid oil bases. The solutions may also contain buffers, 
diluents and other suitable additives. Pharmaceutically acceptable organic or 
inorganic excipients suitable for non-parenteral administration that do not 
deleteriously react with nucleic acids or other contemplated active agents can be used. 

Suitable pharmaceutically acceptable excipients include, but are not limited to, 
water, salt solutions, alcohol, polyethylene glycols, gelatin, lactose, amylose, 
magnesium stearate, talc, silicic acid, viscous paraffin, hydroxymethylcellulose, 
polyvinylpyrrolidone and the like. 

The compositions of the present invention may additionally contain other 
adjunct components conventionally found in pharmaceutical compositions, at their 
art-established usage levels. Thus, for example, the compositions may contain 
additional, compatible, pharmaceutically-active materials such as, e.g., antipruritics, 
astringents, local anesthetics or anti-inflammatory agents, or may contain additional 
materials useful in physically formulating various dosage forms of the compositions 
of the present invention, such as dyes, flavoring agents, preservatives, antioxidants, 
opacifiers, thickening agents and stabilizers. However, such materials, when added, 
should not unduly interfere with the biological activities of the components of the 
compositions of the present invention. The formulations can be sterilized and, if 
desired, mixed with auxiliary agents, e.g., lubricants, preservatives, stabilizers, 
wetting agents, emulsifiers, salts for influencing osmotic pressure, buffers, colorings, 
flavorings and/or aromatic substances and the like which do not deleteriously interact 
with the nucleic acid(s) of the formulation. 

Aqueous suspensions may contain substances that increase the viscosity of the 
suspension including, for example, sodium carboxymethylcellulose, sorbitol and/or 
dextran. The suspension may also contain stabilizers. 

Certain embodiments of the invention provide pharmaceutical compositions 
containing (a) one or more antisense compounds, and (b) one or more other 
chemotherapeutic agents which function by a non-antisense mechanism. Examples of 
such chemotherapeutic agents include, but are not limited to, anticancer drugs such as 
daunorubicin, dactinomycin, doxorubicin, bleomycin, mitomycin, nitrogen mustard, 
chlorambucil, melphalan, cyclophosphamide, 6-mercaptopurine, 6-thioguanine, 
cytarabine (CA), 5-fluorouracil (5-FU), floxuridine (5-FUdR), methotrexate (MTX), 
colchicine, vincristine, vinblastine, etoposide, teniposide, cisplatin and 
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diethylstilbestrol (DES). See, generally, THE MERCK MANUAL OF DIAGNOSIS AND 
Therapy, 1206-28 (15th Ed., Berkow et al., eds., 1987, Rahway, N.J.). Anti- 
inflammatory drugs, including but not limited to nonsteroidal anti-inflammatory 
drugs and corticosteroids, and antiviral drugs, including but not limited to ribivirin, 
vidarabine, acyclovir and ganciclovir, may also be combined in compositions of the 
invention. See, generally, THE MERCK MANUAL OF DIAGNOSIS AND THERAPY, 2499- 
2506 and 46-49 (15th Ed., Berkow et al., eds., 1987, Rahway, N.J.) respectively. 
Other non-antisense chemotherapeutic agents are also within the scope of this 
invention. Two or more combined compounds may be used together or sequentially. 

In another related embodiment, compositions of the invention may contain 
one or more antisense compound or other active agents. Two or more combined 
compounds may be used together or sequentially. 

The formulation of therapeutic compositions and their subsequent 
administration is believed to be within the skill of those in the art. Dosing is 
dependent on severity and responsiveness of the disease state to be treated, with the 
course of treatment lasting from several days to several months, or until a cure is 
effected or a diminution of the disease state is achieved. Optimal dosing schedules 
can be calculated from measurements of drug accumulation in the body of the patient. 
Persons of ordinary skill can easily determine optimum dosages, dosing 
methodologies and repetition rates. Optimum dosages may vary depending on the 
relative potency of individual oligonucleotides, and can generally be estimated based 
on ECs found to be effective in in vitro and in vivo animal models. In general, dosage 
is from 0.01 /ug to 100 g per kg of body weight, and may be given once or more daily, 
weekly, monthly or yearly, or even once every 2 to 20 years. Persons of ordinary 
skill in the art can easily estimate repetition rates for dosing based on measured 
residence times and concentrations of the drug in bodily fluids or tissues. Following 
successful treatment, it may be desirable to have the patient undergo maintenance 
therapy to prevent the recurrence of the disease state, wherein the oligonucleotide is 
administered in maintenance doses, ranging from 0.01 jug to 100 g per kg of body 
weight, once or more daily, to once every 20 years. 

VI. Polypeptide and Peptides 

The polypeptides or peptides of the invention are isolated polypeptides or 
peptides. Preferably these polypeptides are encoded by the smORF identified by the 
in silico process, but they can also be prepared synthetically or by a recombinant 
nucleic acid which would encode the same protein, but is different due to code 
degeneracy than the smORF sequence identified in silico. 

As used herein, with respect to peptides, the term "isolated peptides" and 
"isolated polypeptides" and "isolated protein" mean that the compounds are 
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substantially pure and are essentially free of other substances with which they may be 
found in nature or in vivo systems to an extent practical and appropriate for their 
intended use. In particular, the compounds are sufficiently pure and are sufficiently 
free from other biological constituents of their hosts' cells so as to be useful in, for 

5 example, producing pharmaceutical preparations or sequencing. Because an isolated 
peptide (which as used herein also includes polypeptides and proteins) of the 
invention may be admixed with a pharmaceutically acceptable carrier in a 
pharmaceutical preparation, the peptide may comprise only a small percentage by 
weight of the preparation. The peptide is nonetheless substantially pure in that it has 

10 been substantially separated from the substances with which it may be associated in 
living systems. 

The polypeptides and proteins of the invention can be used to prepare 
antibodies, to identify ligand binding partners, in competition assays, and the like as 
would be known in the art. These assays using fragments of the proteins may be 
15 based on motifs identified in the polypeptides, such as the representative examples 
shown in Table 3 (Motifs). 

VII. Antibodies, Antibody Fragments and Immunologically Active Immunogens 
The invention also contemplates preparation and use of immunoglobulins 
20 against the proteins encoded by the smORFs. By immunoglobulins is meant to 
include antibodies, antibody fragments (e.g., Fab, Fab', Fv, scFv, and F(ab) 2 ), 
bispecific antibodies, polyclonal and monoclonal antibodies, human and humanized 
antibodies, bivalent antibodies and antibody fragments and the like. 

25 A. Humanized and Primatized® Antibodies 

The invention further provides humanized immunoglobulins (or antibodies). 
The humanized antibodies are preferably specific to the protein encoded by a specific 
smORF. These humanized and primatized® antibodies are useful as therapeutic and 
diagnostic reagents in their own right or can be combined to form a humanized or 

30 primatized® bispecific antibody possessing both of the binding specificities of its 
components. 

The humanized and primatized® forms of immunoglobulins have variable 
framework region(s) substantially from a human immunoglobulin (termed an acceptor 
immunoglobulin) and complementarity determining regions substantially from a 
35 mouse immunoglobulin (referred to as the donor immunoglobulin). The constant 
region(s), if present, are also substantially from a human immunoglobulin. The 
humanized antibodies exhibit a specific binding affinity for their respective antigens 
of at least 10 7 , 10 8 , 10 9 , or 10 10 M" 1 . Often the upper and lower limits of binding 
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affinity of the humanized antibodies are within a factor of three or five or ten of that 
of the mouse (or other animal) antibody from which they were derived. 

A "humanized monoclonal antibody" as used herein is a human monoclonal 
antibody or functionally active fragment thereof having human constant regions and a 
region that binds to a protein encoded by a smORF, wherein that region is from a 
mammal of a species other than a human. Humanized monoclonal antibodies may be 
made by any method known in the art. A "primatized® monoclonal antibody" would 
be one having a domain from a primate, such as a cynomolgus macaque. For 
example, see Anderson et al, 1997, Clin. Immunol. Immunopathol. 84: 73-84and U.S. 
Patent Nos. 6,001,358 and 6,113,898. 

Humanized monoclonal antibodies, for example, may be constructed by 
replacing the non-CDR regions of a non-human mammalian antibody with similar 
regions of human antibodies while retaining the epitopic specificity of the original 
antibody. For example, non-human CDRs and optionally some of the framework 
regions may be covalently joined to human FR and/or Fc/pFc' regions to produce a 
functional antibody. Certain corporations are now humanizing antibodies from 
specific murine antibody regions, e.g., Protein Design Labs (Mountain View Calif.). 

European Patent Application 0 239 400 provides an exemplary teaching of the 
production and use of humanized monoclonal antibodies in which at least the 
complementarity determining regions (CDR) portion of a murine (or other non- 
human mammal) antibody is included in the humanized antibody. Briefly, the 
following methods are useful for constructing a humanized CDR monoclonal 
antibody including at least a portion of a mouse CDR. A first replicable expression 
vector including a suitable promoter operably linked to a DNA sequence encoding at 
least a variable domain of an Ig heavy or light chain and the variable domain 
comprising framework regions from a human antibody and a CDR region of a murine 
antibody is prepared. Optionally a second replicable expression vector is prepared 
which includes a suitable promoter operably linked to a DNA sequence encoding at 
least the variable domain of a complementary human Ig light or heavy chain 
respectively. A cell line is then transformed with the vectors. Preferably the cell line 
is an immortalized mammalian cell line of lymphoid origin, such as a myeloma cell 
line, or is a normal lymphoid cell that has been immortalized by transformation with a 
virus. The transformed cell line is then cultured under conditions known to those of 
skill in the art to produce the humanized antibody. 

As set forth in European Patent Application 0 239 400, several techniques are 
well known in the art for creating the particular antibody domains to be inserted into 
the replicable vector. For example, the DNA sequence encoding the domain may be 
prepared by oligonucleotide synthesis. Alternatively a synthetic gene lacking the 
CDR regions in which four framework regions are fused together with suitable 
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restriction sites at the junctions, such that double stranded synthetic or restricted 
subcloned CDR cassettes with sticky ends could be ligated at the junctions of the 
framework regions. Another method involves the preparation of the DNA sequence 
encoding the variable CDR containing domain by oligonucleotide site-directed 
mutagenesis. Each of these methods is well known in the art. Therefore, those 
skilled in the art may construct humanized antibodies containing a murine CDR 
region without destroying the specificity of the antibody for its epitope. 

As noted above, such humanized antibodies may be produced in which some 
or all of the FR regions of deposited monoclonal antibody have been replaced by 
homologous human FR regions. In addition, the Fc portions may be replaced so as to 
produce IgA or IgM as well as human IgG antibodies bearing some or all of the CDRs 
of the deposited monoclonal antibody. In a more preferred embodiment, a murine 
CDR is grafted into the framework region of a human antibody to prepare the 
"humanized antibody." See, e.g., L. Riechmann et al, 1988, Nature 332: 323; M. S. 
Neuberger et al, 1985 Nature 314: 268; and EPA 0 239 400 (published Sep. 30, 
1987). 

In one embodiment of the invention, the peptide containing a region that binds 
to a polypeptide encoded by a smORF is a functionally active antibody fragment. 
Significantly, as is well known in the art, only a small portion of an antibody 
molecule, the paratope, is involved in the binding of the antibody to its epitope (see, 
in general, Clark, W. R. (1986) The Experimental Foundations of Modern 
IMMUNOLOGY Wiley & Sons, Inc., New York; Roitt, I. (1991) ESSENTIAL 
Immunology, 7th Ed., Blackwell Scientific Publications, Oxford). The pFc' and Fc 
regions of the antibody, for example, are effectors of the complement cascade but are 
not involved in antigen binding. An antibody from which the pFc' region has been 
enzymatically cleaved, or which has been produced without the pFc' region, 
designated an F(ab') 2 fragment, retains both of the antigen binding sites of an intact 
antibody. An isolated F(ab') 2 fragment is referred to as a bivalent monoclonal 
fragment because of its two antigen binding sites. Similarly, an antibody from which 
the Fc region has been enzymatically cleaved, or which has been produced without 
the Fc region, designated a Fab fragment, retains one of the antigen binding sites of 
an intact antibody molecule. Proceeding further, Fab fragments consist of a covalently 
bound antibody light chain and a portion of the antibody heavy chain denoted Fd 
(heavy chain variable region). The Fd fragments are the major determinant of 
antibody specificity (a single Fd fragment may be associated with up to ten different 
light chains without altering antibody specificity) and Fd fragments retain epitope- 
binding ability in isolation. Another preferred fragment is the scFv fragment. 

(i) Mouse Antibodies for Humanization. The starting material for production 
of humanized antibody specific could be a protein or immunlogically active portion 
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thereof encoded by SEQ ID NOS: 674-1346 or polypeptides identified by the 
disclosed in silico methods. 

(ii) Selection of Human Antibodies to Supply Framework Residues. The 
substitution of mouse CDRs into a human variable domain framework is most likely 
to result in retention of their correct spatial orientation if the human variable domain 
framework adopts the same or similar conformation to the mouse variable framework 
from which the CDRs originated. This is achieved by obtaining the human variable 
domains from human antibodies whose framework sequences exhibit a high degree of 
sequence identity with the murine variable framework domains from which the CDRs 
were derived. The heavy and light chain variable framework regions can be derived 
from the same or different human antibody sequences. The human antibody 
sequences can be the sequences of naturally occurring human antibodies or can be 
consensus sequences of several human antibodies. 

Suitable human antibody sequences are identified by computer comparisons of 
the amino acid sequences of the mouse variable regions with the sequences of known 
human antibodies. The comparison is performed separately for heavy and light chains 
but the principles are similar for each. 

(Hi) Computer Modeling. The unnatural juxtaposition of murine (or other 
animal) CDR regions with human variable framework region can result in unnatural 
conformational restraints, which, unless corrected by substitution of certain amino 
acid residues, lead to loss of binding affinity. The selection of amino acid residues 
for substitution is determined, in part, by computer modeling. Computer hardware 
and software for producing three-dimensional images of immunoglobulin molecules 
are widely available. In general, molecular models are produced starting from solved 
structures for immunoglobulin chains or domains thereof. The chains to be modeled 
are compared for amino acid sequence similarity with chains or domains of solved 
three-dimensional structures, and the chains or domains showing the greatest 
sequence similarity is/are selected as starting points for construction of the molecular 
model. The solved starting structures are modified to allow for differences between 
the actual amino acids in the immunoglobulin chains or domains being modeled, and 
those in the starting structure. The modified structures are then assembled into a 
composite immunoglobulin. Finally, the model is refined by energy minimization 
and by verifying that all atoms are within appropriate distances from one another and 
that bond lengths and angles are within chemically acceptable limits. 

Computer modeling can also be utilized to identify the portions of a protein 
encoded by a smORF that has a good antigenic profile or hydrophobicity profile. 
This can be performed using algorithms set up by Chou-Fasman and the GOR method 
(Chou et al, 1978, Adv. Enzymol. Relat. Areas Mol. Biol. 47: 45-147; and Gamier et 
al, 1978, J. Mol Biol. 120:97-120). The proteins can also be analyzed using various 
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available computer algorithms to determine whether the potential antigenic region is 

buried within the protein or is exposed at the surface of the protein. See, e.g., David 

W. Mount, BlOINFORMATICS: SEQUENCE AND GENOME ANALYSIS 381-478 (Cold 

Spring Harbor Laboratory Press, 2001). Alternatively, the antibodies and fragments 

5 thereof can be prepared to bind to domains identified by protein modeling, such as 
those of Table 3 (Motifs). 

(iv) Substitution of Amino Acid Residues. As noted supra, the humanized 
antibodies of the invention comprise variable framework region(s) substantially from 
a human immunoglobulin and complementarity determining regions substantially 

10 from a mouse immunoglobulin. Having identified the complementarity determining 
regions of mouse antibodies and appropriate human acceptor immunoglobulins, the 
next step is to determine which, if any, residues from these components should be 
substituted to optimize the properties of the resulting humanized antibody. In 
general, substitution of human amino acid residues with murine should be minimized, 

1 5 because introduction of murine residues increases the risk of the antibody eliciting a 
human anti-murine antibody (HAMA) response in humans. Amino acids are selected 
for substitution based on their possible influence on CDR conformation and/or 
binding to antigen. Investigation of such possible influences is by modeling, 
examination of the characteristics of the amino acids at particular locations, or 

20 empirical observation of the effects of substitution or mutagenesis of particular amino 
acids. 

When an amino acid differs between a mouse variable framework region and 
an equivalent human variable framework region, the human framework amino acid 
should usually be substituted by the equivalent mouse amino acid if it is reasonably 
25 expected that the amino acid: 

(1) noncovalently contacts antigen directly, or 

(2) is adjacent to a CDR region or otherwise interacts with a CDR 
region (e.g., is within about 4-6 A of a CDR region). 

Other candidates for substitution are acceptor human framework amino acids that are 
30 unusual for a human immunoglobulin at that position. These amino acids can be 
substituted with amino acids from the equivalent position of more typical human 
immunoglobulins. Alternatively, amino acids from equivalent positions in the mouse 
antibody can be introduced into the human framework regions when such amino acids 
are typical of human immunoglobulin at the equivalent positions. 
35 In general, substitution of all or most of the amino acids fulfilling the above 

criteria is desirable. Occasionally, however, there is some ambiguity about whether a 
particular amino acid meets the above criteria, and alternative variant 
immunoglobulins are produced, one of which has that particular substitution, the 
other of which does not. 
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Usually the CDR regions in humanized antibodies are substantially identical, 
and more usually, identical to the corresponding CDR regions in the mouse antibody 
from which they were derived. Although not usually desirable, it is sometimes 
possible to make one or more conservative amino acid substitutions of CDR residues 
5 without appreciably affecting the binding affinity of the resulting humanized 

immunoglobulin. Occasionally, substitutions of CDR regions can enhance binding 
affinity. 

Other than for the specific amino acid substitutions discussed above, the 
framework regions of humanized immunoglobulins are usually substantially identical, 

10 and more usually, identical to the framework regions of the human antibodies from 
which they were derived. Of course, many of the amino acids in the framework 
region make little or no direct contribution to the specificity or affinity of an antibody. 
Thus, many individual conservative substitutions of framework residues can be 
tolerated without appreciable change of the specificity or affinity of the resulting 

15 humanized immunoglobulin. 

(v) Production of Variable Regions. Having conceptually selected the CDR 
and framework components of humanized immunoglobulins, a variety of methods are 
available for producing such immunoglobulins. Because of the degeneracy of the 
code, a variety of nucleic acid sequences will encode each immunoglobulin amino 

20 acid sequence. The desired nucleic acid sequences can be produced by de novo solid- 
phase DNA synthesis or by PCR mutagenesis of an earlier prepared variant of the 
desired polynucleotide. All nucleic acids encoding the antibodies described in this 
application are expressly included in the invention. 

(vi) Selection of Constant Region. The variable segments of humanized 

25 antibodies produced as described supra are typically linked to at least a portion of an 
immunoglobulin constant region (Fc), typically that of a human immunoglobulin. 
Human constant region DNA sequences can be isolated in accordance with well- 
known procedures from a variety of human cells, but preferably immortalized B-cells 
(see, e.g., WO87/02671). Ordinarily, the antibody will contain both light chain and 

30 heavy chain constant regions. The heavy chain constant region usually includes C H 1 , 
hinge, C H 2, C H 3, and, sometimes, C H 4 regions. 

The humanized antibodies include antibodies having all types of constant 
regions, including IgM, IgG, IgD, IgA and IgE, and any isotype, including IgGl, 
IgG2, IgG3 and IgG4. When it is desired that the humanized antibody exhibit 

35 cytotoxic activity, the constant domain is usually a complement-fixing constant 

domain and the class is typically IgGl. When such cytotoxic activity is not desirable, 
the constant domain may be of the IgG2 class. The humanized antibody may 
comprise sequences from more than one class or isotype. 
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(vii) Expression Systems. Nucleic acids encoding humanized light and heavy 
chain variable regions, optionally linked to constant regions, are inserted into 
expression vectors. The light and heavy chains can be cloned in the same or different 
expression vectors. The DNA segments encoding immunoglobulin chains are 
5 operably linked to control sequences in the expression vector(s) that ensure the 

expression of immunoglobulin polypeptides. Such control sequences include a signal 
sequence, a promoter, an enhancer, and a transcription termination sequence (see 
Queen et al, 1989, Proc. Natl. Acad. Sci. USA 86: 10029; WO 90/07861; Co et al, 
1992, J. Immunol. 148: 1149). 

10 

B. Fragments of Humanized Antibodies 

The humanized antibodies of the invention include fragments as well as intact 
antibodies. Typically, these fragments compete with the intact antibody from which 
they were derived for antigen binding. The fragments typically bind with an affinity 
15 of at least 10 7 M" 1 , and more typically 1 0 8 or 10 9 M" 1 (i.e., within the same ranges as 
the intact antibody). Humanized antibody fragments include separate heavy chains, 
light chains Fab, Fab', F(ab') 2 , Fv, and scFv. Fragments are produced by recombinant 
DNA techniques, or by enzymatic or chemical separation of intact immunoglobulins. 

20 C. Recombinant Bispecific Antibodies 

The methods discussed above for forming bispecific antibodies from 
antibodies produced by hybridoma cells can also be applied or adapted to production 
of bispecific antibodies from recombinantly expressed antibodies. For example, 
bispecific antibodies can be produced by fusion of two cell lines respectively 

25 expressing the component antibodies. Alternatively, the component antibodies can be 
co-expressed in the same cell line. Bispecific antibodies can also be formed by 
chemical cross-linking of component recombinant antibodies. 

Component recombinant antibodies can also be linked genetically. In one 
approach, a bispecific antibody is expressed as a single fusion protein comprising the 

30 four different variable domains from the two component antibodies separated by 

spacers. For example, such a protein might comprise from one terminus to the other, 
the V L region of the first component antibody, a spacer, the V H domain of the first 
component antibody, a second spacer, the V H domain of the second component 
antibody, a third spacer, and the V L domain of the second component antibody. See, 

35 e.g., Segal et al, 1992 Biologic Therapy of Cancer Updates 2: 1-12. 

In a further approach, bispecific antibodies are formed by linking component 
antibodies to leucine zipper peptides. See generally Kostelny et al., 1992, J. 
Immunol. 148: 1547-1553. Leucine zippers have the general structural formula 
(Leucine-Xi -X 2 -X 3 -X 4 -X 5 -X 6 ) n , where X may be any of the conventional 20 amino 
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acids (Proteins, Structures and Molecular Principles, (1984) Creighton (ed.), 
W. H. Freeman and Company, New York), but are most likely to be amino acids with 
high a-helix forming potential. For example, alanine, valine, aspartic acid, glutamic 
acid, and lysine (Richardson et al, 1988, Science 240: 1648), and n may be 3 or 

5 greater, although typically n is 4 or 5. 

In the formation of bispecific antibodies, binding fragments of the component 
antibodies are fused in-frame to first and second leucine zippers. Suitable binding 
fragments including Fv, Fab, Fab', or the heavy chain. The zippers can be linked to 
the heavy or light chain of the antibody binding fragment and are usually linked to the 

1 0 C-terminal end. If a constant region or a portion of a constant region is present, the 
leucine zipper is preferably linked to the constant region or portion thereof. For 
example, in a Fab'-leucine zipper fusion, the zipper is usually fused to the C-terminal 
end of the hinge. The inclusion of leucine zippers fused to the respective component 
antibody fragments promotes formation of heterodimeric fragments by annealing of 

15 the zippers. When the component antibodies include portions of constant regions 
(e.g., Fab' fragments), the annealing of zippers also serves to bring the constant 
regions into proximity, thereby promoting bonding of constant regions (e.g., in a 
F(ab') 2 fragment). Typical human constant regions bond by the formation of two 
disulfide bonds between hinge regions of the respective chains. This bonding can be 

20 strengthened by engineering additional cysteine residue(s) into the respective hinge 
regions, which allows formation of additional disulfide bonds. 

Leucine zippers linked to antibody binding fragments can be produced in 
various ways. For example, polynucleotide sequences encoding a fusion protein 
comprising a leucine zipper can be expressed by a cellular host or by using an in vitro 

25 translation system. Alternatively, leucine zippers and/or antibody binding fragments 
can be produced separately, either by chemical peptide synthesis, by expression of 
polynucleotide sequences encoding the desired polypeptides, or by cleavage from 
other proteins containing leucine zippers, antibodies, or macromolecular species, and 
subsequent purification. Such purified polypeptides can be linked by peptide bonds, 

30 with or without intervening spacer amino acid sequences, or by non-peptide covalent 
bonds, with or without intervening spacer molecules, the spacer molecules being 
either amino acids or other non-amino acid chemical structures. Regardless of the 
method or type of linkage, such linkage can be reversible. For example, a chemically 
labile bond, either peptidyl or otherwise, can be cleaved spontaneously or upon 

35 treatment with heat, electromagnetic radiation, proteases, or chemical agents. Two 
examples of such reversible linkage are: (1) a linkage comprising an Asn-Gly peptide 
bond which can be cleaved by hydroxylamine, and (2) a disulfide bond linkage which 
can be cleaved by reducing agents. 
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Component antibody fragment-leucine zippers fusion proteins can be annealed 
by co-expressing both fusion proteins in the same cell line. Alternatively, the fusion 
proteins can be expressed in separate cell lines and mixed in vitro. If the component 
antibody fragments include portions of a constant region (e.g., Fab' fragments), the 
leucine zippers can be cleaved after annealing has occurred. The component 
antibodies remain linked in the bispecific antibody via the constant regions. 

As used herein the term "functionally active antibody fragment" means a 
fragment of an antibody molecule including a region that binds to a protein or 
fragment thereof encoded by a smORF, wherein the antibody fragment retains the T- 
cell stimulating functionality of an intact antibody having the same specificity such as 
the deposited monoclonal antibodies. Such fragments are also well known in the art 
and are regularly employed both in vitro and in vivo. In particular, well-known 
functionally active antibody fragments include but are not limited to F(ab') 2 , Fab, Fv, 
scFv and Fd fragments of antibodies. These fragments that lack the Fc fragment of 
intact antibody, clear more rapidly from the circulation, and may have less non- 
specific tissue binding than an intact antibody. For example, single-chain antibodies 
can be constructed in accordance with the methods described in U.S. Pat. No. 
4,946,778 to Ladner et al. Such single-chain antibodies include the variable regions 
of the light and heavy chains joined by a flexible linker moiety. Methods for 
obtaining a single domain antibody ("Fd") which comprises an isolated variable 
heavy chain single domain, also have been reported (see, for example, Ward et al, 
1989, Nature 341: 644-646, disclosing a method of screening to identify an antibody 
heavy chain variable region (V H single domain antibody) with sufficient affinity for 
its target epitope to bind thereto in isolated form). Methods for making recombinant 
Fv fragments based on known antibody heavy chain and light chain variable region 
sequences are known in the art and have been described, e.g., U.S. Pat. No. 
4,462,334. Other references describing the use and generation of antibody fragments 
include e.g., Fab fragments (Tijssen, PRACTICE AND THEORY OF ENZYME 
Immunoassays (Elsevieer, Amsterdam, 1985)), Fv fragments (Hochman et al, 1973 
Biochemistry 12: 1 130; Sharon et al, 1976 Biochemistry 15: 1591; Ehrilch et al., U.S. 
Pat. No. 4,355,023) and portions of antibody molecules (e.g., Audilore-Hargreaves, 
U.S. Pat. No. 4,470,925). 

Functionally active antibody fragments also encompass "humanized antibody 
fragments." As one skilled in the art will recognize, such fragments could be 
prepared by traditional enzymatic cleavage of intact humanized antibodies. If, 
however, intact antibodies are not susceptible to such cleavage, because of the nature 
of the construction involved, the noted constructions can be prepared with 
immunoglobulin fragments used as the starting materials; or, if recombinant 
techniques are used, the DNA sequences, themselves, can be tailored to encode the 
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desired "fragment" which, when expressed, can be combined in vivo or in vitro, by 
chemical or biological means, to prepare the final desired intact immunoglobulin 
fragment. 

Smaller antibody fragments and small binding polypeptides having binding 

5 specificity are also contemplated. Several routine assays may be used to easily 

identify such peptides. Screening assays for identifying peptides of the invention are 
performed for example, using phage display procedures such as those described in 
Hart et al, 1994, Biol. Chem. 269: 12468. In general, phage display libraries using, 
e.g., Ml 3 or fd phage, are prepared using conventional procedures such as those 

10 described in the foregoing reference. The libraries display inserts containing from 4 
to 80 amino acid residues. The inserts optionally represent a completely degenerate 
or a biased array of peptides. Ligands that bind selectively to a smORF polypeptide 
are obtained by selecting those phages, which express on their surface a ligand that 
binds to the smORF polypeptide. These phages then are subjected to several cycles of 

15 reselection to identify the peptide ligand-expressing phages that have the most useful 
binding characteristics. Typically, phages that exhibit the best binding characteristics 
(e.g., highest affinity) are further characterized by nucleic acid analysis to identify the 
particular amino acid sequences of the peptides expressed on the phage surface and 
the optimum length of the expressed peptide to achieve optimum binding to the 

20 protein or polypeptide fragment encoded by a smORF. Alternatively, such peptide 
ligands can be selected from combinatorial libraries of peptides containing one or 
more amino acids. Such libraries can further be synthesized which contain non- 
peptide synthetic moieties, which are less subject to enzymatic degradation compared 
to their naturally occurring counterparts. 

25 Additionally small polypeptides including those containing the smORF 

polypeptide binding CDR3 region may easily be synthesized or produced by 
recombinant means to produce the peptide of the invention. Such methods are well 
known to those of ordinary skill in the art. Peptides can be synthesized for example, 
using automated peptide synthesizers, which are commercially available. The 

30 peptides can be produced by recombinant techniques by incorporating the DNA 
expressing the peptide into an expression vector and transforming cells with the 
expression vector to produce the peptide. 

The sequence of the CDR regions, for use in synthesizing the peptides of the 
invention, may be determined by methods known in the art. The heavy chain variable 

35 region is a peptide, which generally ranges from 100 to 150 amino acids in length (or 
any number in between). The light chain variable region is a peptide, which generally 
ranges from 80 to 130 amino acids in length (or any number in between). The CDR 
sequences within the heavy and light chain variable regions, which include only 
approximately 3-25 amino acid sequences (including any number in between), may 
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easily be sequenced by one of ordinary skill in the art. The peptides may even be 
synthesized synthetically by commercial sources such as by the Scripps Protein and 
Nucleic Acids Core Sequencing Facility (La Jolla Calif.). 

To determine whether a peptide binds to a smORF polypeptide, any known 

5 binding assay may be employed. For example, the peptide may be immobilized on a 
surface and then contacted with a labeled smORF polypeptide. The amount of 
smORF polypeptide that interacts with the peptide or the amount that does not bind to 
the peptide may then be quantitated to determine whether the peptide binds to the 
smORF polypeptide. A surface having the deposited monoclonal antibody 

10 immobilized thereto may serve as a positive control. 

Screening of peptides of the invention, also can be carried out utilizing a 
competition assay. If the peptide being tested competes with the deposited 
monoclonal antibody, as shown by a decrease in binding of the deposited monoclonal 
antibody, then it is likely that the peptide and the deposited monoclonal antibody bind 

15 to the same, or a closely related, epitope. Still another way to determine whether a 
peptide has the specificity of, for example a monoclonal antibody, is to pre-incubate 
the deposited monoclonal antibody with the smORF polypeptide with which it is 
normally reactive, and then add the peptide being tested to determine if the peptide 
being tested is inhibited in its ability to bind to the smORF polypeptide. If the peptide 

20 being tested is inhibited then, in all likelihood, it has the same, or a functionally 
equivalent, epitope and specificity as the deposited monoclonal antibody. Other 
methods and assays would be evident to the artisan of ordinary skill. 

D. Therapeutic Methods 

25 Pharmaceutical compositions comprising bispecific antibodies of the present 

invention are useful for parenteral administration, i.e., subcutaneously (s.c), 
intramuscularly (I.M.) and particularly, intravenously (I.V.). Other contemplated 
forms of administration, depending on the particular need, would be oral, intrathecal, 
and intraperitoneal. The compositions for parenteral administration commonly 

30 comprise a solution of the antibody or a cocktail thereof dissolved in an acceptable 
carrier, preferably an aqueous carrier. A variety of aqueous carriers can be used, e.g., 
water, buffered water, 0.4% saline, 0.3% glycine and the like. These solutions are 
sterile and generally free of particulate matter. The compositions may contain 
pharmaceutically acceptable auxiliary substances as required to approximate 

35 physiological conditions such as pH adjusting and buffering agents, toxicity adjusting 
agents and the like, for example sodium acetate, sodium chloride, potassium chloride, 
calcium chloride, sodium lactate. The concentration of the bispecific antibodies in 
these formulations can vary widely, i.e., from less than about 0.01%, usually at least 
about 0.1% to as much as 5% by weight and will be selected primarily based on fluid 
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volumes, and viscosities in accordance with the particular mode of administration 
selected. 

A typical antibody or antibody fragment composition for intravenous infusion 
can be made up to contain, for example, 250 ml of sterile Ringer's solution, and 10 

5 mg of bispecific antibody. See REMINGTON'S PHARMACEUTICAL SCIENCE (15th Ed., 
Mack Publishing Company, Easton, Pa., 1980). 

The compositions containing the antibodies or antibody cocktails or a cocktail 
thereof can be administered for prophylactic and/or therapeutic treatments. In 
therapeutic application, compositions are administered to a subject with a fungal 

10 infection, which expresses a smORF polypeptide of interest. The amount 
administered to the patient is sufficient to cure or ameliorate the infection or 
corresponding condition caused by the fungus. An amount adequate to accomplish 
this is defined as a "therapeutically effective dose." Amounts effective for use with 
antibodies or antibody fragments will depend upon the severity of the condition and 

15 the general state of the subject, but generally range from about 0.01 to about 100 mg 
of antibody per dose, with dosages of from 0.1 to 50 mg and 1 to 10 mg per patient 
being more commonly used. Single or multiple administrations on a daily, weekly or 
monthly schedule can be carried out with dose levels and pattern being selected by the 
treating physician. 

20 In prophylactic applications, compositions containing the antibodies, 

fragments or peptides which bind to smORF polypeptides or a cocktail thereof are 
administered to a patient who is at risk of developing the disease state to enhance the 
patient's resistance. Such an amount is defined to be a "prophylactically effective 
dose." In this use, the precise amounts again depend upon the subject's state of health 

25 and general level of immunity, but generally range from 0.1 to 100 mg per dose, 
especially 1 to 10 mg per patient. 

E. Diagnostic Methods 

The antibodies and antibody fragments and peptides that bind to smORF 

30 polypeptides can also be useful in diagnostic methods for diagnosing fungal 

infections. Methods of diagnosis can be performed in vitro using a cellular sample 
(e.g., blood sample, lymph node biopsy or tissue) from a patient and performing a 
histological analysis of the sample, or can be performed by in vivo imaging. These 
methods are readily known in the art. 

35 While the present invention has been described with specificity in accordance 

with certain of its preferred embodiments, the examples discussed herein serve only 
to illustrate the invention and are not intended to limit the same. 
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F. Vaccines 

For smORFs identified using the methods described herein, the proteins 
encoded by these smORFs may be determined to be useful for the preparation of 
vaccines. Typically, proteins, or antigenic fragments thereof, are chosen based on 
5 their exposure on the surface of a virus, cell or organism, thus exposing them to the 
immune cells of a host. Additionally, these proteins and protein fragments must be 
antigenic or immunogenic (i.e. the ability of a substance to act as an antigen, which 
elicits a specific immune response when introduced into a host. 

The pharmaceutical compositions for use in obtaining an immune response 
10 would contain such pharmaceutical excipients, adjuvants and/or carriers as are 

standard in preparations designed to obtain an immune response. The therapeutic 
response would be one wherein the subject to which the pharmaceutical composition 
was administered would have a protective effect (i.e., preventing the subject from 
contracting an infection due to the microorganism for which the subject had been 
15 treated). 

(i) Selection oflmmunogen. Vaccines against fungal organisms are 
important to the treatment of a variety of diseases and conditions. For example, 
Cryptococcus neoformans is an opportunistic fungal pathogen which causes an 
incurable, life-threatening meningoencephalitis in patient populations with AIDS. 
20 Coccidioidomycosis is another emerging health problem in light of the increasing 
numbers of immunosuppressed patients. Most infections are caused by Coccidioid.es 
immitis, which can advance into coccidioidal pneumonia or extrapulmonary infection. 
Thus, vaccines against these and other funguses is becoming more important, 
especially with increasing numbers of immune compromised individuals. 

25 Selection of immunogen can be based on one or more factors such as (1) cell 

surface exposure and availability of the protein to a host immune cell, (2) predicted 
antigenicity/immunogenicity of the immunogen, (3) whether the immunogen may be 
N- or O-linked glycosylated; and (4) an extracellular protein (e.g., proteinases, 
esterases and lipases). Certain glyocosylated proteins have served as good antigens in 

30 raising an immune response in animals such as MP98 of Cryptococcus neoformans in 
mice (Levitz et al., Proc. Natl. Acad. Sci. USA 98: 10422-27, 2001); MP65 
mannoprotein of Candida albicans (Antonio, Nippon Ishinkin Gakkai Zasshi 41: 219, 
2000) and the cryptococcal capsular glucuronoxylomannan protected against systemic 
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mycosis in mice (Devi, Vaccine 14: 1298, 1996). Heat shock proteins have also been 
identified as suitable candidates for antifungal vaccines (Deepe et al, J. Immunol. 
167:2219-26, 2001). 

(ii) Polypeptide and DNA Vaccines. Antifungal vaccines can be prepared 
in a variety of ways. For purposes of this invention, living and non-living (i.e., 
derived from the entire microorganism) fungal vaccines are less preferred. More 
preferred are vaccine formulations that can be administered as (1) polypeptides, (2) 
polypeptides conjugated to another antigenic compound, (3) direct inoculation of 
plasmid DNA encoding the desired smORF, wherein expression is driven by a strong 
promoter capable of efficient activity in a variety of mammalian cell types. 

Once suitable immunogens are identified, protein based vaccines can prepared 
wherein one or more smORF polypeptides (20-500 p,g polypeptide, more preferably 
about 50-150 ug ) are mixed with a pharmaceutical^ acceptable adjuvant. If testing 
in animals, an injection is administered to the animal, followed by second and third 
injections a few weeks later. For example, 100 yig of polypeptide (or combination of 
polypeptides) is admixed with a desired adjuvant (e.g., Ribi adjuvant , RIBI 
ImmunoChem Research Inc.). The material can be injected intramuscularly or 
subcutaneously in an animal subject. In mice, the protectiveness of the vaccine can 
be measured by footpad hypersensitivity testing. For instance, the peptide is prepared 
and injected into the hind footpads of the mice with either 50 of spherule-phase 
smORF polypeptide diluted in non-pyrogenic saline or in saline alone. Footpad 
thickness is then measured with a dual caliper and the results calculated as the 
difference in footpad thickness of antigen- and saline-injected pads at 18 to 25 hours 
minus the difference in footpad thickness of antigen- and saline injected pads before 
challenge. Lack of footpad sensitivity indicates that the mice have received some 
protective immunity with the injected antigen. 

Additional methods for preparing, using and assaying pharmaceutical 
compositions for inducing a protective immune response can be performed according 
to what is known in the art. See, for example S.H.E. Kaufmann, Concepts in Vaccine 
Development (Walter De Gruyter 1996); Devi, Vaccine 14: 841-4 (1996); Deepe et 
al., J. Immunol. 167: 2219-26 (2001) and Levitz et al., Proc. Natl. Acad. Sci. USA 98: 
10422-27 (2001). 

For purposes of conferring immunogenicity using a DNA vaccine, the plasmid 
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containing and operably linked to the desired smORF would be administered, for 
example as follows. The desired smORF would be operably linked into a plasmid, 
such as pGEX-4-T3 (Pharmaceia Biotech, Piscataway, NJ) downstream from the gene 
encoding glutathione S-transferase (GST). The smORF containing plasmid is then 
amplified and preferably purified. The plasmid can then be immunized in mice or 
other suitable animal. If using mice, (for example in an assay system), the mice are 
injected with 200 \il of the smORF containing plasmid (100 \xg) or the plasmid alone 
(100 \xg). The plasmid is in a mixture with saline and admixed with an equal volume 
of Ribi adjuvant (RIBI ImmunoChem Research, Inc.) or other DNA vaccine suitable 
adjuvant. Additional components may be present such as synthetic trehalose 
dicorynomycolate (TDM) and cell wall skeleton. The DNA containing composition 
is typically administered intramuscularly or subcutaneously. Second or third injects 
can also be given via intramuscular or subcutaneous routes. The plasmid can also be 
administered intraperitoneally (i.p.). See, e.g., Jiang et al., "Genetic Vaccination 
against Coccidioides immitis: Comparison of Vaccine Efficacy of Recombinant 
Antigen 2 and Antigen 2 cDNA," Infection & Immun. 67: 630-5 (1999). 

In vivo assays of animals, such as mice, can be performed to determine the 
protectiveness of a particular smORF or smORFs or antigenic fragments thereof. 
Once animals have been injected with the smORF DNA, as discussed above, the 
animals can be challenged with exposure to the particular microorganism. Typically 
challenge is by intraperitoneal injection of the microorganism into the animal and 
assessment of survival of the mice with the vaccine as compared to control animals. 
See, e.g., Jiang et al., "Genetic Vaccination against Coccidioides immitis: 
Comparison of Vaccine Efficacy of Recombinant Antigen 2 and Antigen 2 cDNA," 
Infection & Immun. 67: 630-5 (1999). Additional methods of preparing, 
administering, and assaying such compositions would be apparent to the artisan. See 
for example, "Development and Clinical Progress of DNA Vaccines: Paul-Ehrlich- 
Institut" in Developments in Biologicals vol. 104 (F. Brown et al., eds. S. Karger 
Publ., 2000); "DNA Vaccines: Methods and Protocols" in Methods in Molecular 
Medicine vol. 29 (Douglas B. Lowrie and Robert G. Whalen eds, Humana Press, 
2000); Yvonne Pater son, Intracellular Bacterial Vaccine Vectors: Immunology, Cell 
Biology, and Genetics (Wiley-Liss, 1999); Bruce H. Nicholson, Synthetic V accines 
(Blackwell Science Inc. 1994); and Richard E. Isaacson, Recombinant DNA Vaccines 
(Marcel Dekker, 1992). 
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All references discussed above are herein incorporated by reference in their 
entirety. 
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